Re: [RFC] adding support for md5

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: linux@horizon.com
To: rientjes@google.com
Cc: git@vger.kernel.org
Subject: Re: [RFC] adding support for md5
Date: 18 Aug 2006 23:19:31 -0400	[thread overview]
Message-ID: <20060819031931.486.qmail@science.horizon.com> (raw)

This is a Very Dumb Idea.

I normally try to be polite, but this concept is particularly
deserving of scorn.

The idea that more choice is a good thing is sometimes seductive, but when
it comes to standards, that's not a good idea.  It's like the famous joke
told about the dumb inhabitants of your favorite ethnic region: they're
going to try driving on the other side of the road.  Starting next month,
all the cars will drive on the other side.  If that goes well, the month
after they'll add trucks and buses.

If some disaster arose that required Git to change hash functions,
it would be possible to build a conversion utility, although all the
signatures in tag objects would break; you'd have to regenerate them, too.

But this is an all-or-nothing change.  Trying to support *multiple*
simultaneous hash functions is a mess.

Git depends very fundamentally on identical objects having identical
hashes.  The in-memory merge depends on it.  The network protocol
depends on it.  Various kludges can be imagined to handle blob objects
with multiple names, but tree objects quickly become unworkable.

Let's break down the solutions.  There are basically four classes,
depending on 
1) whether objects are stored in the database indexed by both hashes,
   or just one, and
2) Whether pointers to objects include both hashes, or just one.

If you include both hashes everywhere, then you've just built
a larger hash function that's the concatenation of SHA-1 and MD5,
and while it works sanely, it just makes the object IDs even
bigger, and there are obviously no speed benefts.  But this is
the most reasonable alternative.

If pointers include both hashes, but objects are indexed by only one,
then to find an object by pointer requires two lookups, and you still
need to hash every blob twice when committing get the values to
put in the tree objects.  So obviosuly no faster than the first option.

Okay, so pointers are only one hash.  If they were always the same hash,
the second hash would be utterly pointless, so we're assuming the
database contains a mix.

If objects are indexed by both hashes, then you can hash new blobs once
to check to see if they're already in the database, but if they're not,
you have to hash again with the another algorithm.

On the other hand, if you index by only one, then every object being
checked to see if it's already in the database needs to be hashed twice
so it can be looked up twice.  What were the claimed speed gains?

But more to the point, any system which stores one arbitrary hash as
a pointer makes the the tree object created to describe a directory no
longer unique, which results in its hash being fundamentally non-unique,
which cascades all the way up to the commit object.

So you can get silly things like the need for a merge commit to
record the merge of trees that are actually identical.

Just a big mess.

next             reply	other threads:[~2006-08-19  3:19 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-19  3:19 linux [this message]
2006-08-19 22:30 ` [RFC] adding support for md5 Petr Baudis
  -- strict thread matches above, loose matches on Subject: below --
2006-08-18  6:01 David Rientjes
2006-08-18  9:59 ` Nguyễn Thái Ngọc Duy
2006-08-18 10:21 ` Johannes Schindelin
2006-08-18 12:31   ` Petr Baudis
2006-08-18 20:35   ` David Rientjes
2006-08-18 10:52 ` Trekie
2006-08-18 10:56   ` Johannes Schindelin
2006-08-18 11:27     ` Trekie
2006-08-18 11:37       ` Johannes Schindelin
2006-08-18 21:52 ` Jon Smirl
2006-08-19  2:35   ` Johannes Schindelin
2006-08-19 20:50 ` Linus Torvalds
2006-08-21 20:44   ` Chris Wedgwood
2006-08-22  6:18     ` Junio C Hamano
2006-08-23  4:14       ` Shawn Pearce
2006-08-23  4:46         ` Junio C Hamano
2006-08-23  6:49           ` Shawn Pearce
2006-08-24  7:36             ` Junio C Hamano
2006-08-24  8:08               ` Shawn Pearce
2006-08-24 10:34                 ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060819031931.486.qmail@science.horizon.com \
    --to=linux@horizon.com \
    --cc=git@vger.kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).