From mboxrd@z Thu Jan 1 00:00:00 1970 From: linux@horizon.com Subject: Re: VCS comparison table Date: 20 Oct 2006 04:43:18 -0400 Message-ID: <20061020084318.9705.qmail@science.horizon.com> Cc: git@vger.kernel.org, linux@horizon.com X-From: git-owner@vger.kernel.org Fri Oct 20 10:43:35 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1GapyN-00018z-Pb for gcvg-git@gmane.org; Fri, 20 Oct 2006 10:43:27 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2992594AbWJTInU (ORCPT ); Fri, 20 Oct 2006 04:43:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S2992595AbWJTInU (ORCPT ); Fri, 20 Oct 2006 04:43:20 -0400 Received: from science.horizon.com ([192.35.100.1]:43319 "HELO science.horizon.com") by vger.kernel.org with SMTP id S2992594AbWJTInU (ORCPT ); Fri, 20 Oct 2006 04:43:20 -0400 Received: (qmail 9706 invoked by uid 1000); 20 Oct 2006 04:43:18 -0400 To: loki@research.canon.com.au Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: > I've been following the git-vs-bzr discussion, and I'd like to ask a > question (being new to both bzr and git). How does git disambiguate SHA1 > hash collisions? I think git has an alternative way to name revisions > (can someone please explain it in more detail, I've seen ~ > mentioned only in passing in this thread). It seems to me collisions are > a good argument in favour of having two independent naming schemes, so > that you're not solely relying on hashes being unique. > > A strong argument is that a global namespace based on hashes of data is > ideal because the names are generated from the data being named, and > therefore are immutable. Same data => same name for that data, always > and forever, which is desirable when merging named data from many > sources. But the converse isn't true: one name does not necessarily map > to only that data. Have I misunderstood? Is this a problem? Git avoids the problem by using a very large gryptographic hash. While recent cryptographic advances mean that it will soon be barely possible to deliberately construct two files with the same SHA hash, absent massivecomputational investment, it can be considered a truly random hash function. That means that, given N objects in the repository, there are N * (N-1) / 2 pairs, each of which is a chance to collide. The probability of having a collision is thus roughly N * (N-1) / 2^161. Which doesn't reach 1/2 until N == 2^80. Suppose a git repository has 100 objects added per second, continuously, until the Sun turns into a red giant and swallows the Earth in 5-6 billion years' time (thus halting development). That will generate 2^64 objects in the repository, and the probability of there being a collison at that time is 2^128/2^161 = 2^-33, or 1 in 8 billion, small enough to be negligible. For slightly more plausible repository sizes, the odds of having a problem are even more miniscule. And the benefits gained from having a completely deterministic data => name mapping are considerable, well worth running such a minisule risk of major problems. (There is a road map for conversion to the even larger SHA-256 if that ever proves necessary, but it's not likely to.) As for the ~ syntax, see "man git-rev-parse". The parent of is ^. If it has multiple parents, the othersare ^2, ^3, etc. ^^^^ is the great-great-grandparent of the named rev, and may also be written ~4.