Hash algorithm choice

Git development
 help / color / mirror / Atom feed

* Hash algorithm choice
       [not found] <f448a46a0908090907v68542e4dw1f1c4f610cb46ca2@mail.gmail.com>
@ 2009-08-09 16:17 ` Jerome Baum
  2009-08-09 17:46   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jerome Baum @ 2009-08-09 16:17 UTC (permalink / raw)
  To: git

Hi all,

I just had an idea regarding git and hashing. Couldn't find any
previous discussion on the subject so here's what I'm thinking of:

How difficult would it be to allow users to choose a hash function
during git-init which is then globally used in the repo? Are there
many changes needed or are changes in git-hash-object and git-init
sufficient?

I'm not trying to undermine the decision to use SHA-1 or anything, but
I would guess it builds for the future and adds flexibility to the
system. So when SHA-1 is no longer sufficient, it would be easy to
switch to RIPEMD-160 with a simple "git-init --hash=ripemd160"

Would be happy for any comments on this.

Regards,

Jerome Baum
Hugo-Junkers-Str. 2
D-37083 Göttingen
Germany
Tel.: +49 551 2008782
Web: www.JeromeBaum.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 16:17 ` Hash algorithm choice Jerome Baum
@ 2009-08-09 17:46   ` Linus Torvalds
  2009-08-09 18:03     ` Junio C Hamano
  2009-08-09 17:49   ` Johannes Schindelin
  2009-08-09 18:44   ` Matthieu Moy
  2 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2009-08-09 17:46 UTC (permalink / raw)
  To: Jerome Baum; +Cc: git

On Sun, 9 Aug 2009, Jerome Baum wrote:
> 
> How difficult would it be to allow users to choose a hash function
> during git-init which is then globally used in the repo? Are there
> many changes needed or are changes in git-hash-object and git-init
> sufficient?

If youlimit the hash size to 20 bytes, there are almost no changes 
necessary.

You'd need to hijack the 'SHA1_Init/SHA1_Update/SHA1_Final' functions, of 
course, and you'd likely want to rename them (and eventually a lot of 
other functions too), but that renaming is mechanical and isn't even 
needed for proper working.

Now, if you would ever want to extend the _size_ of the hash, that's a 
much much bigger problem, but if you're ok with just changing the hash and 
then truncating the result to 20 bytes (ie kind of like sha-512-160), or 
you're ok with limiting yourself to 20-byte hashes like REIPMD-160, the 
size of the changes should be minimal.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 17:46   ` Linus Torvalds
@ 2009-08-09 18:03     ` Junio C Hamano
  2009-08-09 18:16       ` Sverre Rabbelier
  2009-08-09 18:33       ` Linus Torvalds
  0 siblings, 2 replies; 8+ messages in thread
From: Junio C Hamano @ 2009-08-09 18:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jerome Baum, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> If youlimit the hash size to 20 bytes, there are almost no changes 
> necessary.
>
> You'd need to hijack the 'SHA1_Init/SHA1_Update/SHA1_Final' functions, of 
> course, and you'd likely want to rename them (and eventually a lot of 
> other functions too), but that renaming is mechanical and isn't even 
> needed for proper working.
>
> Now, if you would ever want to extend the _size_ of the hash, that's a 
> much much bigger problem, but if you're ok with just changing the hash and 
> then truncating the result to 20 bytes (ie kind of like sha-512-160), or 
> you're ok with limiting yourself to 20-byte hashes like REIPMD-160, the 
> size of the changes should be minimal.

Just in case Jerome really wants to go further, "almost no changes" and
"minimal" refers to the fact that we have a few hard-coded hash values
known to the code, such as the object name for an empty blob and an empty
tree.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 18:03     ` Junio C Hamano
@ 2009-08-09 18:16       ` Sverre Rabbelier
  2009-08-09 18:35         ` Linus Torvalds
  2009-08-09 18:33       ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Sverre Rabbelier @ 2009-08-09 18:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Jerome Baum, git

Heya,

On Sun, Aug 9, 2009 at 11:03, Junio C Hamano<gitster@pobox.com> wrote:
> Just in case Jerome really wants to go further, "almost no changes" and
> "minimal" refers to the fact that we have a few hard-coded hash values
> known to the code, such as the object name for an empty blob and an empty
> tree.

Wouldn't the transport code also have to be modified? I assume git's
integrity checking would yell at if you gave it commits with
non-sha1-hashes and have no way to tell it that that hash was
calculated with a non-sha1-hash at clone time?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 18:16       ` Sverre Rabbelier
@ 2009-08-09 18:35         ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-08-09 18:35 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Junio C Hamano, Jerome Baum, git

On Sun, 9 Aug 2009, Sverre Rabbelier wrote:
>
> Wouldn't the transport code also have to be modified? I assume git's
> integrity checking would yell at if you gave it commits with
> non-sha1-hashes and have no way to tell it that that hash was
> calculated with a non-sha1-hash at clone time?

Well, if you start introducing new hashes, the assumption is that all 
git's that access it would have to be updated.

You certainly could never pull/push between git versions that don't know 
about each others hashes. But you _can_ autodetect the hash mechanism 
(simple: just try them all on the first object you encounter)

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 18:03     ` Junio C Hamano
  2009-08-09 18:16       ` Sverre Rabbelier
@ 2009-08-09 18:33       ` Linus Torvalds
  1 sibling, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-08-09 18:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jerome Baum, git

On Sun, 9 Aug 2009, Junio C Hamano wrote:
> 
> Just in case Jerome really wants to go further, "almost no changes" and
> "minimal" refers to the fact that we have a few hard-coded hash values
> known to the code, such as the object name for an empty blob and an empty
> tree.

Once nice thing about git is also that in many ways git doesn't even 
_care_ about the actual hash algorithm, because read-only git will 
generally (always?) just use the hashes as pointers.

So you could actually create a very limited sort of git that doesn't have 
any hash algorithm at all - and it would still be able to do a lot of 
regular git operations.

Git strictly speaking needs to hash things only when

 - validating the index (ie if the index and stat data do not match). Git 
   also checks the SHA1 hash at teh end of the index file every time it 
   loads it.

 - creating new objects (ie commit)

 - git-fsck

 - probably some situation I didn't think about.

but during normal operations git doesn't strictly _need_ to hash anything.

For example, I literally just checked what happens when you break our hash 
algorithm on purpose, and while any index operation is unhappy and 
complains about index corruption:

	[torvalds@nehalem git]$ ./git diff
	error: bad index file sha1 signature
	fatal: index file corrupt

that's really largely a sanity check. You can literally do things like 
"git log -p" without ever generating a single hash at all - because all 
git will do is to look up objects based on the hashes it finds.

Now, what's nice about this is that it means that 

 (a) Hash performance only really matters for "git add" and "git fsck" 
     (well, as long as it's not _totally_ sucky. As mentioned above, we do 
     check the integrity of the index file more often, but that could be a 
     different hash than the _object_ hashes - so even if you change the 
     object hashes to be something else than SHA1, you wouldn't 
     necessarily have to change the index checksum)

 (b) You could actually afford to have git auto-detect the hashes from 
     existing objects

 (c) it's not even entirely unreasonable to mix different hashes in the 
     same repository (ie "old objects use old hash, new objects use new 
     hash".

     Aliasing (same object with different hashes) will hurt disk-space, 
     and will make some operations (like merges and 'git diff') much more 
     expensive when they hit a "hash boundary", but other than that you'd 
     never even notice.

Of course, the downside to the above is that git may not notice some kinds 
of corruption until you actually do things like "git fsck" (or native git 
transfers like "git pull", which always check the result very carefully). 

For disk corruption issues, there are things like zlib Adler checksums 
(and xdelta crc's) etc that we always check when unpacking objects, but 
the hash itself only gets recomputed for fairly special events.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 16:17 ` Hash algorithm choice Jerome Baum
  2009-08-09 17:46   ` Linus Torvalds
@ 2009-08-09 17:49   ` Johannes Schindelin
  2009-08-09 18:44   ` Matthieu Moy
  2 siblings, 0 replies; 8+ messages in thread
From: Johannes Schindelin @ 2009-08-09 17:49 UTC (permalink / raw)
  To: Jerome Baum; +Cc: git

Hi,

On Sun, 9 Aug 2009, Jerome Baum wrote:

> I just had an idea regarding git and hashing. Couldn't find any previous 
> discussion on the subject so here's what I'm thinking of:
> 
> How difficult would it be to allow users to choose a hash function 
> during git-init which is then globally used in the repo? Are there many 
> changes needed or are changes in git-hash-object and git-init 
> sufficient?

A quick search revealed this:

http://thread.gmane.org/gmane.comp.version-control.git/25632/focus=25735

Hth,
Dscho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hash algorithm choice
  2009-08-09 16:17 ` Hash algorithm choice Jerome Baum
  2009-08-09 17:46   ` Linus Torvalds
  2009-08-09 17:49   ` Johannes Schindelin
@ 2009-08-09 18:44   ` Matthieu Moy
  2 siblings, 0 replies; 8+ messages in thread
From: Matthieu Moy @ 2009-08-09 18:44 UTC (permalink / raw)
  To: Jerome Baum; +Cc: git

Jerome Baum <jerome@jeromebaum.com> writes:

> How difficult would it be to allow users to choose a hash function
> during git-init which is then globally used in the repo?

There's at least one really difficult thing: how do you merge two
projects using two different hash functions? The Git repository, for
example, has several (I don't remember how many) root repository, and
was originally made of several projects (git, gitk, git gui, ...). If
these projects had started using different hash functions, then,
either:

* Git would have needed to learn how to merge, and record the merge
  history, of projects using different hash functions.

* One of the projects would have been forced to be converted to
  another hash function, which means changing all its identifiers (so,
  for example, finding a message on gmane telling that commit 1ab23cde
  fixes your problem wouldn't help much anymore ...).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-09 18:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <f448a46a0908090907v68542e4dw1f1c4f610cb46ca2@mail.gmail.com>
2009-08-09 16:17 ` Hash algorithm choice Jerome Baum
2009-08-09 17:46   ` Linus Torvalds
2009-08-09 18:03     ` Junio C Hamano
2009-08-09 18:16       ` Sverre Rabbelier
2009-08-09 18:35         ` Linus Torvalds
2009-08-09 18:33       ` Linus Torvalds
2009-08-09 17:49   ` Johannes Schindelin
2009-08-09 18:44   ` Matthieu Moy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox