* Hash collision count @ 2005-04-23 20:27 Jeff Garzik 2005-04-23 20:33 ` Jeff Garzik ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Jeff Garzik @ 2005-04-23 20:27 UTC (permalink / raw) To: Git Mailing List, Linus Torvalds Ideally a hash + collision-count pair would make the best key, rather than just hash alone. A collision -will- occur eventually, and it is trivial to avoid this problem: $n = 0 attempt to store as $hash-$n if $hash-$n exists (unlikely) $n++ goto restart key = $hash-$n Tangent-as-the-reason-I-bring-this-up: One of my long-term projects is an archive service, somewhat like Plan9's venti: a multi-server key-value database, with sha1 hash as the key. However, as the database grows into the terabyte (possibly petabyte) range, the likelihood of a collision transitions rapidly from unlikely -> possible -> likely. Since it is -so- simple to guarantee that you avoid collisions, I'm hoping git will do so before the key structure is too ingrained. Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 20:27 Hash collision count Jeff Garzik @ 2005-04-23 20:33 ` Jeff Garzik 2005-04-23 23:00 ` Ray Heasman 2005-04-24 7:56 ` David Lang 2 siblings, 0 replies; 14+ messages in thread From: Jeff Garzik @ 2005-04-23 20:33 UTC (permalink / raw) To: Git Mailing List, Linus Torvalds Jeff Garzik wrote: > > Ideally a hash + collision-count pair would make the best key, rather > than just hash alone. > > A collision -will- occur eventually, and it is trivial to avoid this > problem: > > $n = 0 > attempt to store as $hash-$n > if $hash-$n exists (unlikely) > $n++ > goto restart > key = $hash-$n This of course presumes that you know that your new data does not exist in the cache. Otherwise, you would need to hash the on-disk $hash-0 to determine if its a collision or a reference. Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 20:27 Hash collision count Jeff Garzik 2005-04-23 20:33 ` Jeff Garzik @ 2005-04-23 23:00 ` Ray Heasman 2005-04-23 23:20 ` Jeff Garzik 2005-04-24 7:56 ` David Lang 2 siblings, 1 reply; 14+ messages in thread From: Ray Heasman @ 2005-04-23 23:00 UTC (permalink / raw) To: Jeff Garzik; +Cc: Git Mailing List, Linus Torvalds On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote: > Ideally a hash + collision-count pair would make the best key, rather > than just hash alone. > > A collision -will- occur eventually, and it is trivial to avoid this > problem: > > $n = 0 > attempt to store as $hash-$n > if $hash-$n exists (unlikely) > $n++ > goto restart > key = $hash-$n > Great. So what have you done here? Suppose you have 32 bits of counter for n. Whoopee, you just added 32 bits to your hash, using a two stage algorithm. So, you have a 192 bit hash assuming you started with the 160 bit SHA. And, one day your 32 bit counter won't be enough. Then what? > Tangent-as-the-reason-I-bring-this-up: > > One of my long-term projects is an archive service, somewhat like > Plan9's venti: a multi-server key-value database, with sha1 hash as the > key. > > However, as the database grows into the terabyte (possibly petabyte) > range, the likelihood of a collision transitions rapidly from unlikely > -> possible -> likely. > > Since it is -so- simple to guarantee that you avoid collisions, I'm > hoping git will do so before the key structure is too ingrained. You aren't solving anything. You're just putting it off, and doing it in a way that breaks all the wonderful semantics possible by just assuming that the hash is unique. All of a sudden we are doing checks of data that we never did before, and we have to do the check trillions of times before the CPU time spent pays off. If you want to use a bigger hash then use a bigger hash, but don't fool yourself into thinking that isn't what you are doing. Ciao, Ray ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 23:00 ` Ray Heasman @ 2005-04-23 23:20 ` Jeff Garzik 2005-04-23 23:46 ` Petr Baudis 2005-04-24 1:01 ` Ray Heasman 0 siblings, 2 replies; 14+ messages in thread From: Jeff Garzik @ 2005-04-23 23:20 UTC (permalink / raw) To: Ray Heasman; +Cc: Git Mailing List, Linus Torvalds Ray Heasman wrote: > On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote: > >>Ideally a hash + collision-count pair would make the best key, rather >>than just hash alone. >> >>A collision -will- occur eventually, and it is trivial to avoid this >>problem: >> >> $n = 0 >> attempt to store as $hash-$n >> if $hash-$n exists (unlikely) >> $n++ >> goto restart >> key = $hash-$n >> > > > Great. So what have you done here? Suppose you have 32 bits of counter > for n. Whoopee, you just added 32 bits to your hash, using a two stage > algorithm. So, you have a 192 bit hash assuming you started with the 160 > bit SHA. And, one day your 32 bit counter won't be enough. Then what? First, there is no 32-bit limit. git stores keys (aka hashes) as strings. As it should. Second, in your scenario, it's highly unlikely you would get 4 billion sha1 hash collisions, even if you had the disk space to store such a git database. >>Tangent-as-the-reason-I-bring-this-up: >> >>One of my long-term projects is an archive service, somewhat like >>Plan9's venti: a multi-server key-value database, with sha1 hash as the >>key. >> >>However, as the database grows into the terabyte (possibly petabyte) >>range, the likelihood of a collision transitions rapidly from unlikely >>-> possible -> likely. >> >>Since it is -so- simple to guarantee that you avoid collisions, I'm >>hoping git will do so before the key structure is too ingrained. > > > You aren't solving anything. You're just putting it off, and doing it in > a way that breaks all the wonderful semantics possible by just assuming > that the hash is unique. All of a sudden we are doing checks of data > that we never did before, and we have to do the check trillions of times > before the CPU time spent pays off. First, the hash is NOT unique. Second, you lose data if you pretend it is unique. I don't like losing data. Third, a data check only occurs in the highly unlikely case that a hash already exists -- a collision. Rather than "trillions of times", more like "one in a trillion chance." Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 23:20 ` Jeff Garzik @ 2005-04-23 23:46 ` Petr Baudis 2005-04-24 0:35 ` Jeff Garzik 2005-04-25 23:50 ` Tom Lord 2005-04-24 1:01 ` Ray Heasman 1 sibling, 2 replies; 14+ messages in thread From: Petr Baudis @ 2005-04-23 23:46 UTC (permalink / raw) To: Jeff Garzik; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds Dear diary, on Sun, Apr 24, 2005 at 01:20:21AM CEST, I got a letter where Jeff Garzik <jgarzik@pobox.com> told me that... > Second, in your scenario, it's highly unlikely you would get 4 billion > sha1 hash collisions, even if you had the disk space to store such a git > database. It's highly unlikely you would get a _single_ collision. > First, the hash is NOT unique. > > Second, you lose data if you pretend it is unique. I don't like losing > data. *sigh* We've been through this before, haven't we? > Third, a data check only occurs in the highly unlikely case that a hash > already exists -- a collision. Rather than "trillions of times", more > like "one in a trillion chance." No, a collision is pretty common thing, actually. It's the main power of git, actually - when you do read-tree, modify it and do write-tree (typically when doing commit), everything you didn't modify (99% of stuff, most likely) is basically a collision - but it's ok since it just stays the same. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 23:46 ` Petr Baudis @ 2005-04-24 0:35 ` Jeff Garzik 2005-04-24 0:40 ` Petr Baudis 2005-04-25 23:50 ` Tom Lord 1 sibling, 1 reply; 14+ messages in thread From: Jeff Garzik @ 2005-04-24 0:35 UTC (permalink / raw) To: Petr Baudis; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds Petr Baudis wrote: > Dear diary, on Sun, Apr 24, 2005 at 01:20:21AM CEST, I got a letter > where Jeff Garzik <jgarzik@pobox.com> told me that... > >>Second, in your scenario, it's highly unlikely you would get 4 billion >>sha1 hash collisions, even if you had the disk space to store such a git >>database. > > > It's highly unlikely you would get a _single_ collision. Agreed. >>First, the hash is NOT unique. >> >>Second, you lose data if you pretend it is unique. I don't like losing >>data. > > > *sigh* > > We've been through this before, haven't we? <shrug> In messing around with archive servers, people get nervous using (hash,value) based storage if there isn't even a simple test for collisions. Someone just told me that one implementation of the Venti archive server[1] simply fails the write, if a data item exists with a duplicate hash value. As long as git fails or does something -predictable- in the face of the hash collision, I'm satisfied. Jeff [1] http://www.cs.bell-labs.com/sys/doc/venti/venti.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-24 0:35 ` Jeff Garzik @ 2005-04-24 0:40 ` Petr Baudis 2005-04-24 0:43 ` Jeff Garzik 0 siblings, 1 reply; 14+ messages in thread From: Petr Baudis @ 2005-04-24 0:40 UTC (permalink / raw) To: Jeff Garzik; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds Dear diary, on Sun, Apr 24, 2005 at 02:35:57AM CEST, I got a letter where Jeff Garzik <jgarzik@pobox.com> told me that... > Someone just told me that one implementation of the Venti archive > server[1] simply fails the write, if a data item exists with a duplicate > hash value. As long as git fails or does something -predictable- in the > face of the hash collision, I'm satisfied. -DCOLLISION_CHECK See the top of Makefile. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-24 0:40 ` Petr Baudis @ 2005-04-24 0:43 ` Jeff Garzik 2005-04-24 21:24 ` Imre Simon 0 siblings, 1 reply; 14+ messages in thread From: Jeff Garzik @ 2005-04-24 0:43 UTC (permalink / raw) To: Petr Baudis; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds Petr Baudis wrote: > -DCOLLISION_CHECK Cool. I am happy, then :) Make sure that's enabled by default... Thanks, Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-24 0:43 ` Jeff Garzik @ 2005-04-24 21:24 ` Imre Simon 2005-04-24 22:25 ` Whales falling on houses - was: " Jon Seymour 0 siblings, 1 reply; 14+ messages in thread From: Imre Simon @ 2005-04-24 21:24 UTC (permalink / raw) To: Jeff Garzik Cc: Petr Baudis, Ray Heasman, Git Mailing List, Linus Torvalds, Imre Simon On 4/23/05, Jeff Garzik <jgarzik@pobox.com> wrote: > Petr Baudis wrote: > > -DCOLLISION_CHECK > > Cool. I am happy, then :) > > Make sure that's enabled by default... > > Thanks, > > Jeff I would like to second the suggestion that this should be enabled by default. First, I do think it is highly unlikely that a collision will ever be found. Actually because of this if one is found it would be an important byproduct of git and it would probably influence future system designs. Hence, I believe that it is worth loosing some efficiency in order to illuminate better this question of the collision. I would like to add a reciepe by which one would surely produce a collision with two files of the same length and quite similar to each other. I believe that there is a popular belief that this is pretty much impossible. Of course, we do not have time to execute this algorithm, but I believe that it convinces everyone that similar files *do* exist (actually in abundance) with the same hash. 1. Take your favorite text file, at least 160 characters long. 2. Choose 160 positions in this file. 3. For each position choose your favorite mispelling of that character. 4. Produce all 2^160 text files, all of the same length, choosing for each position either the original or the alternate character 5. Add an arbitrary file of the same length, different from the above Two of these files have the same sha1 hash. Or, for that matter, for any 160 bit hash the same is true. Cheers, Imre Simon ^ permalink raw reply [flat|nested] 14+ messages in thread
* Whales falling on houses - was: Hash collision count 2005-04-24 21:24 ` Imre Simon @ 2005-04-24 22:25 ` Jon Seymour 0 siblings, 0 replies; 14+ messages in thread From: Jon Seymour @ 2005-04-24 22:25 UTC (permalink / raw) To: Imre Simon Cc: Jeff Garzik, Petr Baudis, Ray Heasman, Git Mailing List, Linus Torvalds, Imre Simon .> > 1. Take your favorite text file, at least 160 characters long. > 2. Choose 160 positions in this file. > 3. For each position choose your favorite mispelling of that character. > 4. Produce all 2^160 text files, all of the same length, choosing for > each position either the original or the alternate character > 5. Add an arbitrary file of the same length, different from the above > > Two of these files have the same sha1 hash. Or, for that matter, for > any 160 bit hash the same is true. If you were to create those files at 10^9 files per second, it would take you 10^38 years before you were in position to take step 5. I am about to turn 38 this week. Would that I could live to 10^38. It's absolute rubbish to say that the best solution from an <double-quote>engineering</double-quote> point of view is to eliminate the infinitessimal possibility of a collision. Engineering is all about assessing risk and making suitable trade-offs. Every day of the week, "real" engineers accept life-threatening risks that put thousands of peoples lives in danger. They do it because we live in a world where risk cannot be eliminated, merely reduced to an acceptable level. I can't understand that you are a prepared to drive a car or fly in a Boeing or Airbus that has a demonstrated risk of killing you, yet you want to insist on eliminating a risk that at most might create an interesting Slashdot headline: "Jolt-crazed programmer finds SHA1 collision - but later dies when whale falls on house". jon. -- homepage: http://www.zeta.org.au/~jon/ blog: http://orwelliantremors.blogspot.com/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 23:46 ` Petr Baudis 2005-04-24 0:35 ` Jeff Garzik @ 2005-04-25 23:50 ` Tom Lord 2005-04-26 0:00 ` Petr Baudis 1 sibling, 1 reply; 14+ messages in thread From: Tom Lord @ 2005-04-25 23:50 UTC (permalink / raw) To: pasky; +Cc: git From: Petr Baudis <pasky@ucw.cz> Pasky: > No, a collision is pretty common thing, actually. It's the main power of > git, actually - when you do read-tree, modify it and do write-tree > (typically when doing commit), everything you didn't modify (99% of > stuff, most likely) is basically a collision - but it's ok since it > just stays the same. That is not the way people ordinarily use the word "collision". It's pretty much the opposite of the normal way, actually. -t ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-25 23:50 ` Tom Lord @ 2005-04-26 0:00 ` Petr Baudis 0 siblings, 0 replies; 14+ messages in thread From: Petr Baudis @ 2005-04-26 0:00 UTC (permalink / raw) To: Tom Lord; +Cc: git Dear diary, on Tue, Apr 26, 2005 at 01:50:31AM CEST, I got a letter where Tom Lord <lord@emf.net> told me that... > > From: Petr Baudis <pasky@ucw.cz> > > Pasky: > > > No, a collision is pretty common thing, actually. It's the main power of > > git, actually - when you do read-tree, modify it and do write-tree > > (typically when doing commit), everything you didn't modify (99% of > > stuff, most likely) is basically a collision - but it's ok since it > > just stays the same. > > That is not the way people ordinarily use the word "collision". > It's pretty much the opposite of the normal way, actually. You need to quote me in the context of Jeff Garzik's > > Third, a data check only occurs in the highly unlikely case that a hash > > already exists -- a collision. Rather than "trillions of times", more > > like "one in a trillion chance." I just wanted to point out that the data check would hahve to occur everytime you didn't modify an object. Kind regards, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 23:20 ` Jeff Garzik 2005-04-23 23:46 ` Petr Baudis @ 2005-04-24 1:01 ` Ray Heasman 1 sibling, 0 replies; 14+ messages in thread From: Ray Heasman @ 2005-04-24 1:01 UTC (permalink / raw) To: Jeff Garzik; +Cc: Git Mailing List On Sat, 2005-04-23 at 19:20 -0400, Jeff Garzik wrote: > Ray Heasman wrote: > > On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote: > > > >>Ideally a hash + collision-count pair would make the best key, rather > >>than just hash alone. > >> > >>A collision -will- occur eventually, and it is trivial to avoid this > >>problem: > >> > >> $n = 0 > >> attempt to store as $hash-$n > >> if $hash-$n exists (unlikely) > >> $n++ > >> goto restart > >> key = $hash-$n > >> > > > > > > Great. So what have you done here? Suppose you have 32 bits of counter > > for n. Whoopee, you just added 32 bits to your hash, using a two stage > > algorithm. So, you have a 192 bit hash assuming you started with the 160 > > bit SHA. And, one day your 32 bit counter won't be enough. Then what? > > First, there is no 32-bit limit. git stores keys (aka hashes) as > strings. As it should. Oh great, now we have variable length id strings too. And we'll have to pretend the OS can store infinite length file names. > Second, in your scenario, it's highly unlikely you would get 4 billion > sha1 hash collisions, even if you had the disk space to store such a git > database. Er. So your so-unlikely-the-sun-will-burn-out-first scenario beats my so-unlikely-the-sun-will-burn-out-first scenario? Why am I not worried? > > You aren't solving anything. You're just putting it off, and doing it in > > a way that breaks all the wonderful semantics possible by just assuming > > that the hash is unique. All of a sudden we are doing checks of data > > that we never did before, and we have to do the check trillions of times > > before the CPU time spent pays off. > > First, the hash is NOT unique. Nooooo. Really? Why not just use a 8192 bit hash for each 1KiB of data? We could store a zero length file and store all the data in the filename. Guaranteed, no hash collisions that way. We make an assumption that we know is right most of the time, and we use it because we know our computer will crash from random quantum fluctuations before we have a chance of bumping into the problem. You do know that metastability means that every logic gate in your computer hardware is guaranteed to fail every "x" operations, where x is defined by process size, voltage and temperature? Sure, any failures in git would be data dependent rather than random, but that just means we don't get to store carefully crafted blocks invented by hypothetical cryptographers that have completely broken SHA. > Second, you lose data if you pretend it is unique. I don't like losing > data. You lose data either way. Just we get to burn out a few extra suns before yours dies, and I can burn out whole galaxies before mine dies by using a 256 bit hash, anyway. > Third, a data check only occurs in the highly unlikely case that a hash > already exists -- a collision. Rather than "trillions of times", more > like "one in a trillion chance." Heh. I calculate it has a 50% probability of happening after you have seen 10^24 input blocks. So, you are off by a factor of a trillion or so. Assuming we store 1 KiB blocks with a 160-bit hash, we would be able to store 1000 Trillion Terabytes before the chance of hitting a collision goes to 50%. To use marketing units, that is around 10 Trillion Libraries of Congress. Every 2 bits we add to the hash doubles the amount of data we can store before we hit a 50% probability of collision. I'm not sure how I could convince you that we're arguing about the number of angels that could dance on a pin. Ciao, Ray ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hash collision count 2005-04-23 20:27 Hash collision count Jeff Garzik 2005-04-23 20:33 ` Jeff Garzik 2005-04-23 23:00 ` Ray Heasman @ 2005-04-24 7:56 ` David Lang 2 siblings, 0 replies; 14+ messages in thread From: David Lang @ 2005-04-24 7:56 UTC (permalink / raw) To: Jeff Garzik; +Cc: Git Mailing List, Linus Torvalds On Sat, 23 Apr 2005, Jeff Garzik wrote: > Ideally a hash + collision-count pair would make the best key, rather than > just hash alone. > > A collision -will- occur eventually, and it is trivial to avoid this problem: > > $n = 0 > attempt to store as $hash-$n > if $hash-$n exists (unlikely) > $n++ > goto restart > key = $hash-$n > > Tangent-as-the-reason-I-bring-this-up: > > One of my long-term projects is an archive service, somewhat like Plan9's > venti: a multi-server key-value database, with sha1 hash as the key. > > However, as the database grows into the terabyte (possibly petabyte) range, > the likelihood of a collision transitions rapidly from unlikely -> possible > -> likely. > > Since it is -so- simple to guarantee that you avoid collisions, I'm hoping > git will do so before the key structure is too ingrained. Jeff, this can't work becouse you don't know what objects exist on other servers, in fact given the number of different repositories that will eventually exist the odds are good that when the colision occures it will be when object repositories get combined, David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2005-04-25 23:55 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-04-23 20:27 Hash collision count Jeff Garzik 2005-04-23 20:33 ` Jeff Garzik 2005-04-23 23:00 ` Ray Heasman 2005-04-23 23:20 ` Jeff Garzik 2005-04-23 23:46 ` Petr Baudis 2005-04-24 0:35 ` Jeff Garzik 2005-04-24 0:40 ` Petr Baudis 2005-04-24 0:43 ` Jeff Garzik 2005-04-24 21:24 ` Imre Simon 2005-04-24 22:25 ` Whales falling on houses - was: " Jon Seymour 2005-04-25 23:50 ` Tom Lord 2005-04-26 0:00 ` Petr Baudis 2005-04-24 1:01 ` Ray Heasman 2005-04-24 7:56 ` David Lang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).