Re: Suggestion on hashing

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Suggestion on hashing
       [not found] <1322813319.4340.109.camel@yos>
@ 2011-12-02 14:22 ` Nguyen Thai Ngoc Duy
  2011-12-02 18:09   ` Jeff King
                     ` (2 more replies)
  2011-12-02 17:54 ` Jeff King
  1 sibling, 3 replies; 14+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-12-02 14:22 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: Jeff King, Git Mailing List

(I'm not sure why you dropped git@vger. I see nothing private here so
I bring git@vger back)

On Fri, Dec 2, 2011 at 3:08 PM, Bill Zaumen <bill.zaumen@gmail.com> wrote:
> At one point Nguyen said that "What I'm thinking is whether it's
> possible to decouple two sha-1 roles in git, as object identifier
> and digest, separately. Each sha-1 identifies an object and an extra
> set of digests on the "same" object."
>
> My code pretty much does that (it just uses a CRC instead of a real
> digest, but I can easily change that).

It'd be easier to look at your code if you split it into a series of
smaller patches.

> So the question is whether
> using SHA-1 as an ID and SHA-256(?) as a digest is a better long term
> solution than simply replacing SHA-1.

I would not stick with any algorithm permanently. No one knows when
SHA-256 might be broken.

> If there is some interest in pursuing it further, I could make those
> changes fairly easily.  Then you'd have two message digests, a SHA-1
> and a longer one, with the longer one stored parallel to the actual
> object. Then it becomes easy to compute a digest of all the digests
> in a commit's tree and store that in a commit, if that is what you
> want to do.

I personally would like to see how it works out especially when
computing new digests is much more expensive than SHA-1. And I hope
that by delaying computing new digests (stored outside actual
objects), we could make minimum code changes to git. Though security
concerns may be the killer factor and I haven't worked that out yet.

> Replacing SHA-1 with something like SHA-256 sounds easier to implement,

SHA-1 charateristics (like 20 byte length) are hard coded everywhere
in git, it'd be a big audit.

> but the problem is all the existing repositories.  While rewriting all
> the objects and trees to use new hashes is similar to a rebase in most
> cases, there is a complication - submodules.  Git stores the hash of
> a submodule's commit in its tree because a particular revision of
> a project 'goes' with a particular revision of a submodule. But, a
> submodule can exist in one revision and not in the next or previous
> revision  Furthermore A could be a submodule of B at one point in time,
> and many commits later, B could end up being a submodule of A.
> Fixing it up could be pretty complicated (plus having to deal with
> network failures - to update GitHub for example, you'd have to download
> submodules it uses, possibly from somewhere else and some submodules may
> not be publicly accessible (e.g., a private project kept on GitHub but
> with a critical submodule kept in house behind a corporate firewall).
> Also, you might have to update a git repository and its submodules
> concurrently, so that you always can find a new value when you need
> it.
>
> My guess is that this could be far more complicated than what I did.
> Excluding two files that are not used (the symbol PACKDB is not
> defined), I added two new files, crcdb.h and objd-crcdb.c which store
> CRCs for loose objects - 517 lines total including lots of comments in
> the header file - full documentation for each function.  The other
> changes include 1475 lines of new code in previously existing git files
> and 136 deletions (most trivial).  There were also minor changes to
> the makefile and test scripts.

You'd need to convince git maintainer this is worth doing first,
before talking how big the changes are ;-)

> Bill
-- 
Duy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
       [not found] <1322813319.4340.109.camel@yos>
  2011-12-02 14:22 ` Suggestion on hashing Nguyen Thai Ngoc Duy
@ 2011-12-02 17:54 ` Jeff King
  2011-12-03  1:50   ` Bill Zaumen
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2011-12-02 17:54 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, pclouds

On Fri, Dec 02, 2011 at 12:08:39AM -0800, Bill Zaumen wrote:

> At one point Nguyen said that "What I'm thinking is whether it's
> possible to decouple two sha-1 roles in git, as object identifier
> and digest, separately. Each sha-1 identifies an object and an extra
> set of digests on the "same" object."
> 
> My code pretty much does that (it just uses a CRC instead of a real
> digest, but I can easily change that).   So the question is whether
> using SHA-1 as an ID and SHA-256(?) as a digest is a better long term
> solution than simply replacing SHA-1.

I think your code is solving the wrong problem (or solving the right
problem in a half-way manner). The only things that make sense to me
are:

  1. Do nothing. SHA-1 is probably not broken yet, even by the NSA, and
     even if it is, an attack is extremely expensive to mount. This may
     change in the future, of course, but it will probably stay
     expensive for a while.

  2. Decouple the object identifier and digest roles, but insert the
     digest into newly created objects, so it can be part of the
     signature chain. I described such a scheme in one of my replies to
     you. It has some complexities, but has the bonus that we can build
     directly on older history, preserving its sha1s.

  3. Replace SHA-1 with a more secure algorithm.

I'm probably in favor of (1) at this point. Whether to do (2) or (3)
will depend on where we are when SHA-1 gets feasibly broken. It may be
many years away, at which point we may be considering a git 2.0 that
breaks repository compatibility, anyway. That would be a natural time to
consider changing the algorithm.

> Replacing SHA-1 with something like SHA-256 sounds easier to implement,
> but the problem is all the existing repositories.

Right. I don't think anyone is denying that it would be a giant pain.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-02 14:22 ` Suggestion on hashing Nguyen Thai Ngoc Duy
@ 2011-12-02 18:09   ` Jeff King
  2011-12-03  0:48   ` Bill Zaumen
  2011-12-06  1:56   ` Chris West (Faux)
  2 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2011-12-02 18:09 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Bill Zaumen, Git Mailing List

On Fri, Dec 02, 2011 at 09:22:31PM +0700, Nguyen Thai Ngoc Duy wrote:

> > So the question is whether
> > using SHA-1 as an ID and SHA-256(?) as a digest is a better long term
> > solution than simply replacing SHA-1.
> 
> I would not stick with any algorithm permanently. No one knows when
> SHA-256 might be broken.

Yeah, you could stick a few bits of algorithm parameter in the beginning
of each identifier. It would mean unique hashes get one character or so
longer (and they would all start with "1", or whatever the identifier
is).

SHA-256 doesn't suffer from SHA-1's problems, though they are based on
related constructions, so I think there is some concern that it may
eventually fail in the same way. SHA-3 is a better bet in that sense,
but it will also be very unproven, even once it is actually
standardized.

> > Replacing SHA-1 with something like SHA-256 sounds easier to implement,
> 
> SHA-1 charateristics (like 20 byte length) are hard coded everywhere
> in git, it'd be a big audit.

In theory, you could truncate a longer hash to 160-bits. It's not the
bit-strength of SHA-1 that is the problem, but the attacks on the
algorithm itself which reduce the bit-strength to something too low.
I would think a truncated result would retain the same cryptographic
properties, as one of the properties of the un-truncated hash is that
changes in the input data are reflected throughout the hash. Some
hashes, like Skein, explicitly have a big internal state, and then just
let you output as many bytes as is appropriate (i.e., being a drop-in
replacement for SHA-1 is an explicit goal).

But I'm not a cryptographer, so there may be some subtle issues with
doing that to arbitrary hash functions.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-02 14:22 ` Suggestion on hashing Nguyen Thai Ngoc Duy
  2011-12-02 18:09   ` Jeff King
@ 2011-12-03  0:48   ` Bill Zaumen
  2011-12-06  1:56   ` Chris West (Faux)
  2 siblings, 0 replies; 14+ messages in thread
From: Bill Zaumen @ 2011-12-03  0:48 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Jeff King, Git Mailing List

On Fri, 2011-12-02 at 21:22 +0700, Nguyen Thai Ngoc Duy wrote:
> (I'm not sure why you dropped git@vger. I see nothing private here so
> I bring git@vger back)

Oh, I just didn't want to flood the mailing list with too much on
one topic and figured we could summarize a discussion at some point
and post that, but if you'd rather keep it all on the list, that's 
fine with me.

I can split the code into a series of smaller patches - smaller than
the set of three I sent, but I'm not sure if the test scripts will work
with all of the intermediate patches if I do that.

I can also make the digest (current a CRC) pluggable.  Then you can try
different digests as an experiment and see how that affects performance.
My implementation uses the CRC or new digests only when
the object database is being modified or explicitly verified. Basically
the code provides memoization for an additional hash function, used
for whatever purpose you desire.

If you want to put a digest of message digests into a commit message,
you can do that fairly quickly as one level of digests has been
precomputed. I think Jeff's or your suggestion of putting an additional
digest in the commit message is a good idea.  If you want to experiment
with such changes, the code would provide a reasonable start on that.

So, I guess I should make those changes - pluggable digest and 
splitting the patches further.

> You'd need to convince git maintainer this is worth doing first,
> before talking how big the changes are ;-)

I'd guess there are several issues: the amount of code, how complex
the changes are, what the performance impacts are, whether the changes
are backwards compatible, and what you get for the effort.

As a start on the last question, "what you get," aside from some extra
checking to detect problems, if you modify commit messages and signed
tags to use better digests, you can make a stronger argument regarding
authentication.  For example, suppose you have a project in which your
code is dual-licensed - GPL for free use but a separate license if the
code is used in a proprietary product and there is a legal dispute,
using a better digest than SHA-1 would have some advantages - when they
start calling in expert witnesses, one side will bring in a security
expert who will testify that SHA-1 is too weak to be used for
authentication, citing government publications such
as http://csrc.nist.gov/groups/ST/hash/statement.html as evidence. The
jury is not going to consist of people who can fully understand the
details, so being able to say that git's authentication matches current
best practices would be an additional reason to use git.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-02 17:54 ` Jeff King
@ 2011-12-03  1:50   ` Bill Zaumen
  2011-12-03 15:08     ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Bill Zaumen @ 2011-12-03  1:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git, pclouds

On Fri, 2011-12-02 at 12:54 -0500, Jeff King wrote:
> On Fri, Dec 02, 2011 at 12:08:39AM -0800, Bill Zaumen wrote:

> I think your code is solving the wrong problem (or solving the right
> problem in a half-way manner). The only things that make sense to me
> are:
> 
>   1. Do nothing. SHA-1 is probably not broken yet, even by the NSA, and
>      even if it is, an attack is extremely expensive to mount. This may
>      change in the future, of course, but it will probably stay
>      expensive for a while.
> 
>   2. Decouple the object identifier and digest roles, but insert the
>      digest into newly created objects, so it can be part of the
>      signature chain. I described such a scheme in one of my replies to
>      you. It has some complexities, but has the bonus that we can build
>      directly on older history, preserving its sha1s.
> 
>   3. Replace SHA-1 with a more secure algorithm.

Suppose I make the digest pluggable, something I intended to do
eventually anyway?  Then you just use the existing SHA-1 as an
object identifier and the new digest in a signature chain?  What I
did was essentially to compute the new digest (using a CRC as the
trivial case) whenever an object's SHA-1 hash is computed, plus
using the new digest for low-cost collision checks.

Then you have everything needed to experiment with your second option.
I got the impression that Nguyen had some interest in that, but could
be mistaken.

The use is simple: if you have the SHA-1 hash of an object, you call
a function, currently named "has_sha1_file_crc" and it returns true if
a CRC is available, writing the hash into the buffer supplied as its
second argument.  You can do whatever you like with it.  If you want
a digest of digests, you just traverse a commit's tree, and call
has_sha1_file_crc whenever you want to look up a digest.  So, the API
is actually very simple if you just use the patch to quickly look up
the digest associated with a SHA-1 ID - everything else it does happens
automatically.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-03  1:50   ` Bill Zaumen
@ 2011-12-03 15:08     ` Jeff King
  2011-12-03 15:34       ` Philip Oakley
  2011-12-03 21:21       ` Bill Zaumen
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff King @ 2011-12-03 15:08 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, pclouds

On Fri, Dec 02, 2011 at 05:50:21PM -0800, Bill Zaumen wrote:

> On Fri, 2011-12-02 at 12:54 -0500, Jeff King wrote:
> > On Fri, Dec 02, 2011 at 12:08:39AM -0800, Bill Zaumen wrote:
> 
> > I think your code is solving the wrong problem (or solving the right
> > problem in a half-way manner). The only things that make sense to me
> > are:
> > 
> >   1. Do nothing. SHA-1 is probably not broken yet, even by the NSA, and
> >      even if it is, an attack is extremely expensive to mount. This may
> >      change in the future, of course, but it will probably stay
> >      expensive for a while.
> > 
> >   2. Decouple the object identifier and digest roles, but insert the
> >      digest into newly created objects, so it can be part of the
> >      signature chain. I described such a scheme in one of my replies to
> >      you. It has some complexities, but has the bonus that we can build
> >      directly on older history, preserving its sha1s.
> > 
> >   3. Replace SHA-1 with a more secure algorithm.
> 
> Suppose I make the digest pluggable, something I intended to do
> eventually anyway?  Then you just use the existing SHA-1 as an
> object identifier and the new digest in a signature chain?  What I
> did was essentially to compute the new digest (using a CRC as the
> trivial case) whenever an object's SHA-1 hash is computed, plus
> using the new digest for low-cost collision checks.

If you make the digest stronger (or pluggable) and include it in the
actual objects themselves, then you have a start on (2).

I'd drop all of the digest-exchange bits from the protocol, as the
actual signatures are the real, trustable verification. I don't think
you can drop the external storage of the digests, which is one of the
ugliest bits. You'll be asking for the digests all the time to create
new commit objects, so you need to have it at hand without rehashing.

And I wouldn't get my hopes up that this will go into git any time soon.
At this point, we're really guessing about how broken SHA-1 will be in
the future, and how much we are going to want to care.

Just my two cents.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-03 15:08     ` Jeff King
@ 2011-12-03 15:34       ` Philip Oakley
  2011-12-03 21:21       ` Bill Zaumen
  1 sibling, 0 replies; 14+ messages in thread
From: Philip Oakley @ 2011-12-03 15:34 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, pclouds, Jeff King

Had you seen the recent thread by Junio with the footnote link to the paper 
on reconcilliation by using multiple hashes?
http://article.gmane.org/gmane.linux.kernel/1214517.
"What's the Difference? Efficient Set Reconciliation without Prior
Context" http://cseweb.ucsd.edu/~fuyeda/papers/sigcomm2011.pdfIt looks to 
have a lot of the properties being sought, and links with other git aspects.

Philip

From: "Jeff King" <peff@peff.net>: Saturday, December 03, 2011 3:08 PM
On Fri, Dec 02, 2011 at 05:50:21PM -0800, Bill Zaumen wrote:

> On Fri, 2011-12-02 at 12:54 -0500, Jeff King wrote:
> > On Fri, Dec 02, 2011 at 12:08:39AM -0800, Bill Zaumen wrote:
>
> > I think your code is solving the wrong problem (or solving the right
> > problem in a half-way manner). The only things that make sense to me
> > are:
> >
> >   1. Do nothing. SHA-1 is probably not broken yet, even by the NSA, and
> >      even if it is, an attack is extremely expensive to mount. This may
> >      change in the future, of course, but it will probably stay
> >      expensive for a while.
> >
> >   2. Decouple the object identifier and digest roles, but insert the
> >      digest into newly created objects, so it can be part of the
> >      signature chain. I described such a scheme in one of my replies to
> >      you. It has some complexities, but has the bonus that we can build
> >      directly on older history, preserving its sha1s.
> >
> >   3. Replace SHA-1 with a more secure algorithm.
>
> Suppose I make the digest pluggable, something I intended to do
> eventually anyway?  Then you just use the existing SHA-1 as an
> object identifier and the new digest in a signature chain?  What I
> did was essentially to compute the new digest (using a CRC as the
> trivial case) whenever an object's SHA-1 hash is computed, plus
> using the new digest for low-cost collision checks.

If you make the digest stronger (or pluggable) and include it in the
actual objects themselves, then you have a start on (2).

I'd drop all of the digest-exchange bits from the protocol, as the
actual signatures are the real, trustable verification. I don't think
you can drop the external storage of the digests, which is one of the
ugliest bits. You'll be asking for the digests all the time to create
new commit objects, so you need to have it at hand without rehashing.

And I wouldn't get my hopes up that this will go into git any time soon.
At this point, we're really guessing about how broken SHA-1 will be in
the future, and how much we are going to want to care.

Just my two cents.

-Peff
--
Version: 2012.0.1873 / Virus Database: 2102/4653 - Release Date: 12/02/11

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-03 15:08     ` Jeff King
  2011-12-03 15:34       ` Philip Oakley
@ 2011-12-03 21:21       ` Bill Zaumen
  1 sibling, 0 replies; 14+ messages in thread
From: Bill Zaumen @ 2011-12-03 21:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, pclouds

On Sat, 2011-12-03 at 10:08 -0500, Jeff King wrote:

> > 
> > Suppose I make the digest pluggable, something I intended to do
> > eventually anyway?  Then you just use the existing SHA-1 as an
> > object identifier and the new digest in a signature chain?  What I
> > did was essentially to compute the new digest (using a CRC as the
> > trivial case) whenever an object's SHA-1 hash is computed, plus
> > using the new digest for low-cost collision checks.
> 
> If you make the digest stronger (or pluggable) and include it in the
> actual objects themselves, then you have a start on (2).
> 
> I'd drop all of the digest-exchange bits from the protocol, as the
> actual signatures are the real, trustable verification. I don't think
> you can drop the external storage of the digests, which is one of the
> ugliest bits. You'll be asking for the digests all the time to create
> new commit objects, so you need to have it at hand without rehashing.

The digest-exchange bits, including the tests and response to errors,
is only 222 lines of new code, so its really a minor part.  The rest
takes care of what you referred to as "one of the ugliest bits," so
I think it is useful to have available - you can then try various ways
of improving the authentication of commit objects without having to do
a lot of initial work.

I can make those changes - probably over the next couple of weeks or
so (have some other non-related things to take care of) and then send
a new set of patches.

> 
> And I wouldn't get my hopes up that this will go into git any time soon.
> At this point, we're really guessing about how broken SHA-1 will be in
> the future, and how much we are going to want to care.
> 
> Just my two cents.

Thanks for the discussion.  I might add that it is not just a question
of how broken SHA-1 is.  If an IT department is considering adopting Git
as the company's revision control system and authentication is important
to the company, an IT manager may not accept SHA-1 for authentication
purposes because NIST claims SHA-1 is not adequate for authentication in
general and explaining to upper management why NIST's statement is not
applicable given the way SHA-1 is used in Git is much harder than
saying, "Git follows the current best practices regarding
authentication."  That statement is a simple check-list item one can
show upper management in comparing alternatives.

Such issues (making technical choices for non-technical reasons) have
come up before - I once worked on a high-speed (for the time) networking
project and our manager mentioned that transferring medical records such
as X-ray pictures was one application - they do not accept lossy data
compression because, even if it is completely adequate, in a malpractice
suit, the plaintiff's lawyer would say, "And they purposely threw away
data critical to my client's health," which would sound pretty damning
to a typical jury.  The legal risk outweighed the cost of the additional
bandwidth.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-02 14:22 ` Suggestion on hashing Nguyen Thai Ngoc Duy
  2011-12-02 18:09   ` Jeff King
  2011-12-03  0:48   ` Bill Zaumen
@ 2011-12-06  1:56   ` Chris West (Faux)
  2011-12-06  3:47     ` Bill Zaumen
  2011-12-06  4:46     ` Nguyen Thai Ngoc Duy
  2 siblings, 2 replies; 14+ messages in thread
From: Chris West (Faux) @ 2011-12-06  1:56 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Bill Zaumen, Jeff King, Git Mailing List

Nguyen Thai Ngoc Duy wrote:
> SHA-1 charateristics (like 20 byte length) are hard coded everywhere
> in git, it'd be a big audit.

I was planning to look at this anyway.  My branch[1] allows
  init/add/commit with SHA-256, SHA-512 and all the SHA-3 candidates.

log/fsck/etc. are all broken.  Don't even dare try packs.  Fixing things
  is painful but not impossible.  I'm not convinced the task is even
  remotely insurmountable.

(This is not a request-for-comments, just an informational notification.
  It does not even attempt to address compatability or the like.)

$ make HASH=sha512 -j6
$ PATH=bin-wrappers:..
$ git init && echo hi > foo && git add foo && git commit -m "bang"
Initialized empty Git repository in /.../.git/
[master (root-commit) 
8d3ae658dff0c6e398bb4a0d193974e49acfadedfcd61daca42c931ac18d5ac46f0a068e08d81c25d7b79b1c3f4951e4340eeb90f0ef39de355c9bab7e75faba] 
bang
  1 files changed, 1 insertions(+), 0 deletions(-)
    create mode 100644 foo

1. (Please use the hash-v0.0.1 tag, I rebase.)
   gitweb: http://preview.tinyurl.com/bsufh92
   git://git.goeswhere.com/git/git.git
   https://github.com/FauxFaux/git/tree/hash-v0.0.1

---
Chris West (Faux)
Freenode #git: FauxFaux
https://ssl.goeswhere.com/key-transition-2011-10-10.txt.asc
gpg: 408A E4F1 4EA7 33EF 1265  82C1 B195 E1C4 779B A9B2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-06  1:56   ` Chris West (Faux)
@ 2011-12-06  3:47     ` Bill Zaumen
  2011-12-06  4:46     ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 14+ messages in thread
From: Bill Zaumen @ 2011-12-06  3:47 UTC (permalink / raw)
  To: Chris West (Faux); +Cc: Nguyen Thai Ngoc Duy, Jeff King, Git Mailing List

When I went through the code, I noted that SHA-1 hashes are
currently used for the following:

   * object IDs
   * authentication (something to sign using public-key encryption)
   * data integrity (basically a really good checksum).

While there are lot of 20-byte arrays of unsigned char, many of those
are associated with lookups.  You might want to look at the
number of places that git_SHA1_Init is called (there aren't all that
many of those, and that function indicates the points where SHA-1
hashes are being created).

While a few things I tried were complete false starts (kept those
out of the preliminary patches I sent), I managed to store
a CRC (which you can treat as a place-holder for a real message
digest) for each SHA-1 hash in a pack file, but I did it by
creating a separate file (extension ".mds") and that worked.
I looked into modifying pack files, and that was too messy given
that you'd want older version to still work with newer remote
repositories.  The other factor is that the "mds" files are
computed locally, and at the same time that you create an "idx" file.
The formats of the "pack" and "idx" files don't change.

I've just started on replacing the CRC I used with real message
digests, making new digests easy to add. The plan is to initially
make it work with both a CRC and SHA-1 (the CRC so I can test it
easily by comparing new and old versions to show that nothing
changed when it shouldn't have), and because Git already implements
SHA-1.

I should complete my changes.  If we are lucky, maybe the changes I'm
trying would solve some of the problems you mentioned with pack files.
At least I can store the digests in a way that doesn't break the log
and fsck operations (it went through all the test suites, with only
minor modifications for things like counting the number of files in
particular directories).

If you make changes to commit objects, fixing the test scripts is a 
pain - there are a number of places where SHA-1 values are hard-
coded, and those have to be replaced.

Bill

On Tue, 2011-12-06 at 01:56 +0000, Chris West (Faux) wrote:
> Nguyen Thai Ngoc Duy wrote:
> > SHA-1 charateristics (like 20 byte length) are hard coded everywhere
> > in git, it'd be a big audit.
> 
> I was planning to look at this anyway.  My branch[1] allows
>   init/add/commit with SHA-256, SHA-512 and all the SHA-3 candidates.
> 
> log/fsck/etc. are all broken.  Don't even dare try packs.  Fixing things
>   is painful but not impossible.  I'm not convinced the task is even
>   remotely insurmountable.
> 
> (This is not a request-for-comments, just an informational notification.
>   It does not even attempt to address compatability or the like.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-06  1:56   ` Chris West (Faux)
  2011-12-06  3:47     ` Bill Zaumen
@ 2011-12-06  4:46     ` Nguyen Thai Ngoc Duy
  2011-12-06  6:02       ` Bill Zaumen
  1 sibling, 1 reply; 14+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-12-06  4:46 UTC (permalink / raw)
  To: Chris West (Faux); +Cc: Bill Zaumen, Jeff King, Git Mailing List

On Tue, Dec 6, 2011 at 8:56 AM, Chris West (Faux) <faux@goeswhere.com> wrote:
>
> Nguyen Thai Ngoc Duy wrote:
>>
>> SHA-1 charateristics (like 20 byte length) are hard coded everywhere
>> in git, it'd be a big audit.
>
>
> I was planning to look at this anyway.  My branch[1] allows
>  init/add/commit with SHA-256, SHA-512 and all the SHA-3 candidates.

Great!

> log/fsck/etc. are all broken.  Don't even dare try packs.  Fixing things
>  is painful but not impossible.  I'm not convinced the task is even
>  remotely insurmountable.

It would take more work, but after you're done with code changes, you
should have a look at updating the test suite. We have many SHA-1s
there. If the test suite passes, your job is (beautifully) done.
-- 
Duy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-06  4:46     ` Nguyen Thai Ngoc Duy
@ 2011-12-06  6:02       ` Bill Zaumen
  2011-12-06  6:23         ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 14+ messages in thread
From: Bill Zaumen @ 2011-12-06  6:02 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Chris West (Faux), Jeff King, Git Mailing List

On Tue, 2011-12-06 at 11:46 +0700, Nguyen Thai Ngoc Duy wrote:
> On Tue, Dec 6, 2011 at 8:56 AM, Chris West (Faux) <faux@goeswhere.com> wrote:
> >
> > Nguyen Thai Ngoc Duy wrote:
> >>
> >> SHA-1 charateristics (like 20 byte length) are hard coded everywhere
> >> in git, it'd be a big audit.
> >
> >
> > I was planning to look at this anyway.  My branch[1] allows
> >  init/add/commit with SHA-256, SHA-512 and all the SHA-3 candidates.
> 
> Great!

If you are replacing SHA-1 as an object ID with another hash function,
two things to watch are submodules and alternative object databases.
Because of those, it is necessary to worry about the order in which
repositories are converted.  In the worst case for submodules, you'd
have to do multiple repositories at the same time, switching between
them depending on what you need at each point.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-06  6:02       ` Bill Zaumen
@ 2011-12-06  6:23         ` Nguyen Thai Ngoc Duy
  2011-12-07  1:44           ` Bill Zaumen
  0 siblings, 1 reply; 14+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-12-06  6:23 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: Chris West (Faux), Jeff King, Git Mailing List

On Tue, Dec 6, 2011 at 1:02 PM, Bill Zaumen <bill.zaumen@gmail.com> wrote:
> On Tue, 2011-12-06 at 11:46 +0700, Nguyen Thai Ngoc Duy wrote:
>> On Tue, Dec 6, 2011 at 8:56 AM, Chris West (Faux) <faux@goeswhere.com> wrote:
>> >
>> > Nguyen Thai Ngoc Duy wrote:
>> >>
>> >> SHA-1 charateristics (like 20 byte length) are hard coded everywhere
>> >> in git, it'd be a big audit.
>> >
>> >
>> > I was planning to look at this anyway.  My branch[1] allows
>> >  init/add/commit with SHA-256, SHA-512 and all the SHA-3 candidates.
>>
>> Great!
>
> If you are replacing SHA-1 as an object ID with another hash function,
> two things to watch are submodules and alternative object databases.
> Because of those, it is necessary to worry about the order in which
> repositories are converted.  In the worst case for submodules, you'd
> have to do multiple repositories at the same time, switching between
> them depending on what you need at each point.

I know migration would be painful. But note that new repos can benefit
stronger digest without legacy (of course until it links to an old
repo). For submodules, I think we should extend it to become something
similar to soft-link: git link is an SHA-1 to a text file that
contains SHA-1 and maybe other digests of the submodule's tip.
-- 
Duy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Suggestion on hashing
  2011-12-06  6:23         ` Nguyen Thai Ngoc Duy
@ 2011-12-07  1:44           ` Bill Zaumen
  0 siblings, 0 replies; 14+ messages in thread
From: Bill Zaumen @ 2011-12-07  1:44 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Chris West (Faux), Jeff King, Git Mailing List

On Tue, 2011-12-06 at 13:23 +0700, Nguyen Thai Ngoc Duy wrote:
> On Tue, Dec 6, 2011 at 1:02 PM, Bill Zaumen <bill.zaumen@gmail.com> wrote:

> > If you are replacing SHA-1 as an object ID with another hash function,
> > two things to watch are submodules and alternative object databases.
> > Because of those, it is necessary to worry about the order in which
> > repositories are converted.  In the worst case for submodules, you'd
> > have to do multiple repositories at the same time, switching between
> > them depending on what you need at each point.
> 
> I know migration would be painful. But note that new repos can benefit
> stronger digest without legacy (of course until it links to an old
> repo). For submodules, I think we should extend it to become something
> similar to soft-link: git link is an SHA-1 to a text file that
> contains SHA-1 and maybe other digests of the submodule's tip.

Repositories would need to store a table mapping old SHA-1 values to
the new ones (for commits).  There's nothing in a repository to
reliably indicate that it is being used as a submodule, and the choice
of submodules can vary from commit to commit, making it difficult to
control the order in which objects have their hashes updated.  In some
corner cases, you could have two branches in each of two repositories
with different choices as to which is a submodule of which, although
I'd be surprised if anyone actually did that.

Aside from that, in some corporate environments, the IT departments
want to determine the release schedule for applications, and would
take a dim view of changes that could not be tested first without being
widely deployed.  You could end up making Git unacceptable for those
departments if you do not maintain backwards compatibility with
existing repositories.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-12-07  1:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1322813319.4340.109.camel@yos>
2011-12-02 14:22 ` Suggestion on hashing Nguyen Thai Ngoc Duy
2011-12-02 18:09   ` Jeff King
2011-12-03  0:48   ` Bill Zaumen
2011-12-06  1:56   ` Chris West (Faux)
2011-12-06  3:47     ` Bill Zaumen
2011-12-06  4:46     ` Nguyen Thai Ngoc Duy
2011-12-06  6:02       ` Bill Zaumen
2011-12-06  6:23         ` Nguyen Thai Ngoc Duy
2011-12-07  1:44           ` Bill Zaumen
2011-12-02 17:54 ` Jeff King
2011-12-03  1:50   ` Bill Zaumen
2011-12-03 15:08     ` Jeff King
2011-12-03 15:34       ` Philip Oakley
2011-12-03 21:21       ` Bill Zaumen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).