git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bill Zaumen <bill.zaumen+git@gmail.com>
To: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Cc: Jeff King <peff@peff.net>,
	git@vger.kernel.org, gitster@pobox.com, spearce@spearce.org,
	torvalds@linux-foundation.org
Subject: Re: [PATCH] Implement fast hash-collision detection
Date: Wed, 30 Nov 2011 11:00:04 -0800	[thread overview]
Message-ID: <1322679604.1710.64.camel@yos> (raw)
In-Reply-To: <CACsJy8A6kGmn0h0xdxfTC4krXgc8hzO1fHTdqfk0YnASGN5K0w@mail.gmail.com>

[Will send a reply to Jeff's comment from last night with some 
clarifications and explanations later].

> What I'm thinking is whether it's possible to decouple two sha-1 roles
> in git, as object identifier and digest, separately. Each sha-1
> identifies an object and an extra set of digests on the "same" object.
> Object database is extended to store all these new digests and mapping
> between sha-1 and them. When we need to verify an object, given an
> sha-1, we rehash that object and check the result digest with the ones
> linked to the sha-1.

The patch I created (at least, a reasonable chunk of the code) kind of
does that:  it is very easy to change the CRC to whatever message digest
one wants.  I used a CRC primarily because I had the impression that
people were very concerned about speed, but it is easy to change that to
the message digest of your choice.  In any case, it might be a good
starting point if you want to try something in a different direction.

Basically, when you create a loose object, in addition to getting a
SHA-1 ID, you get a message digest that gets stored as well (in a
separate file). When you index a pack file, you get an IDX file
containing the SHA-1 ID  plus a corresponding MDS file containing the
message digest. Index-pack calculates the SHA-1 value from the object
stored in the pack file, and the (additional) message digest is computed
at the same time using the same data.  Commands like verify-pack check
both the IDX file and the MDS file for consistency with the matching
pack file.  The new message digest (the CRC in the patch) is used only
in cases where a repository is being altered (e.g., a loose object or
pack file is being created or a fetch, push, or pull operation) or some
explicit verification operation is running (e.g., git verify-pack).

Adding an additional header to the commit message is a good idea (I had
actually tried that, but something went wrong, although one of you
suggested what the problem might have been --- I can try again if there
is some interest in pursuing that).

It might be worth pointing out that you can use the SHA-1 hash of the
contents of objects (e.g., without the Git object header) as an
additional digest:  I tried a test using two 128-byte files with the
same MD5 hash, differing past the 20th byte, and deleted the first
four bytes of each.  With those bytes deleted, the hash collision
went away. I doubt if there is a known efficient algorithm that can
generate a hash collision for two files and for two other files that
differ from the first set by deleting N bytes from both.

  parent reply	other threads:[~2011-11-30 19:00 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1322546563.1719.22.camel@yos>
2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
2011-11-29 10:29     ` Jeff King
2011-11-29 13:17   ` Nguyen Thai Ngoc Duy
2011-11-29 15:23     ` Shawn Pearce
2011-11-29 14:04   ` Nguyen Thai Ngoc Duy
2011-11-29 20:59     ` Jeff King
2011-11-30 13:35       ` Nguyen Thai Ngoc Duy
2011-11-30 18:05         ` Junio C Hamano
2011-12-01  4:43           ` Nguyen Thai Ngoc Duy
2011-11-30 19:00         ` Bill Zaumen [this message]
2011-11-29 21:56   ` Bill Zaumen
2011-11-30  6:25     ` Jeff King
2011-12-01  0:41       ` Bill Zaumen
2011-12-01  5:26         ` Jeff King
2011-12-02  2:59           ` Bill Zaumen
2011-12-02 17:00             ` Jeff King
2011-11-29 17:08 ` Shawn Pearce
2011-11-29 22:05   ` Jeff King
2011-11-30  4:01   ` Bill Zaumen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1322679604.1710.64.camel@yos \
    --to=bill.zaumen+git@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).