Git development
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jerome Baum <jerome@jeromebaum.com>, git@vger.kernel.org
Subject: Re: Hash algorithm choice
Date: Sun, 9 Aug 2009 11:33:18 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LFD.2.01.0908091116340.3288@localhost.localdomain> (raw)
In-Reply-To: <7vljls986b.fsf@alter.siamese.dyndns.org>



On Sun, 9 Aug 2009, Junio C Hamano wrote:
> 
> Just in case Jerome really wants to go further, "almost no changes" and
> "minimal" refers to the fact that we have a few hard-coded hash values
> known to the code, such as the object name for an empty blob and an empty
> tree.

Once nice thing about git is also that in many ways git doesn't even 
_care_ about the actual hash algorithm, because read-only git will 
generally (always?) just use the hashes as pointers.

So you could actually create a very limited sort of git that doesn't have 
any hash algorithm at all - and it would still be able to do a lot of 
regular git operations.

Git strictly speaking needs to hash things only when

 - validating the index (ie if the index and stat data do not match). Git 
   also checks the SHA1 hash at teh end of the index file every time it 
   loads it.

 - creating new objects (ie commit)

 - git-fsck

 - probably some situation I didn't think about.

but during normal operations git doesn't strictly _need_ to hash anything.

For example, I literally just checked what happens when you break our hash 
algorithm on purpose, and while any index operation is unhappy and 
complains about index corruption:

	[torvalds@nehalem git]$ ./git diff
	error: bad index file sha1 signature
	fatal: index file corrupt

that's really largely a sanity check. You can literally do things like 
"git log -p" without ever generating a single hash at all - because all 
git will do is to look up objects based on the hashes it finds.

Now, what's nice about this is that it means that 

 (a) Hash performance only really matters for "git add" and "git fsck" 
     (well, as long as it's not _totally_ sucky. As mentioned above, we do 
     check the integrity of the index file more often, but that could be a 
     different hash than the _object_ hashes - so even if you change the 
     object hashes to be something else than SHA1, you wouldn't 
     necessarily have to change the index checksum)

 (b) You could actually afford to have git auto-detect the hashes from 
     existing objects

 (c) it's not even entirely unreasonable to mix different hashes in the 
     same repository (ie "old objects use old hash, new objects use new 
     hash".

     Aliasing (same object with different hashes) will hurt disk-space, 
     and will make some operations (like merges and 'git diff') much more 
     expensive when they hit a "hash boundary", but other than that you'd 
     never even notice.

Of course, the downside to the above is that git may not notice some kinds 
of corruption until you actually do things like "git fsck" (or native git 
transfers like "git pull", which always check the result very carefully). 

For disk corruption issues, there are things like zlib Adler checksums 
(and xdelta crc's) etc that we always check when unpacking objects, but 
the hash itself only gets recomputed for fairly special events.

			Linus

  parent reply	other threads:[~2009-08-09 18:33 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <f448a46a0908090907v68542e4dw1f1c4f610cb46ca2@mail.gmail.com>
2009-08-09 16:17 ` Hash algorithm choice Jerome Baum
2009-08-09 17:46   ` Linus Torvalds
2009-08-09 18:03     ` Junio C Hamano
2009-08-09 18:16       ` Sverre Rabbelier
2009-08-09 18:35         ` Linus Torvalds
2009-08-09 18:33       ` Linus Torvalds [this message]
2009-08-09 17:49   ` Johannes Schindelin
2009-08-09 18:44   ` Matthieu Moy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.01.0908091116340.3288@localhost.localdomain \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jerome@jeromebaum.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox