Re: Is GIT_DEFAULT_HASH flawed?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Felipe Contreras <felipe.contreras@gmail.com>
To: demerphq <demerphq@gmail.com>,
	Felipe Contreras <felipe.contreras@gmail.com>
Cc: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	git@vger.kernel.org, Adam Majer <adamm@zombino.com>
Subject: Re: Is GIT_DEFAULT_HASH flawed?
Date: Wed, 03 May 2023 12:20:45 -0600	[thread overview]
Message-ID: <6452a5fdf2bd9_6822945f@chronos.notmuch> (raw)
In-Reply-To: <CANgJU+UasufF7-B8ukEMm_Lv8gu4wUpaVKa9AOBacDHJvi7fxQ@mail.gmail.com>

demerphq wrote:
> On Wed, 3 May 2023 at 02:17, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
> > Changing the subject as this message seems like a different topic.
> > Jeff King wrote:
> > > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > > > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > > >
> > > > >  `GIT_DEFAULT_HASH`::
> > > > >   If this variable is set, the default hash algorithm for new
> > > > >   repositories will be set to this value. This value is currently
> > > > > + ignored when cloning if the remote value can be definitively
> > > > > + determined; the setting of the remote repository is used
> > > > > + instead. The value is honored if the remote repository's
> > > > > + algorithm cannot be determined, such as some cases when
> > > > > + the remote repository is empty. The default is "sha1".
> > > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > > > + in linkgit:git-init[1].
> > > >
> > > > We'd need to evantually cover all the transports (and non-transport
> > > > like the "--local" optimization) so that the object-format and other
> > > > choices are communicated from the origin to a new clone anyway, so
> > > > this extra complexity "until X is fixed, it behaves this way, but
> > > > otherwise the variable is read in the meantime" may be a disservice
> > > > to the end users, even though it may make it easier in the shorter
> > > > term for maintainers of programs that rely on the buggy "git clone"
> > > > that partially honored this environment variable.
> > > >
> > > > In short, I am still not convinced that the above is a good design
> > > > choice in the longer term.
> > >
> > > I also think it is working against the backwards-compatible design of
> > > the hash function transition.
> >
> > To be honest this whole approach seems to be completely flawed to me and
> > against the whole design of git in the first place.
> >
> > In a recent email Linus Torvalds explained why object ids were
> > calculated based {type, size, data} [1], and he explained very clearly
> > that two objects with exactly the same data are not supposed to have the
> > same id if the type is different.
> 
> He said:
> 
> --- quote-begin ---
> The "no aliasing" means that no two distinct pointers can point to the
> same data. So a tagged pointer of type "commit" can not point to the
> same object as a tagged pointer of type "blob". They are distinct
> pointers, even if (maybe) the commit object encoding ends up then
> being identical to a blob object.
> --- quote-end ---
> 
> As far as I could tell he didn't really explain *why* he wanted this,
> and IMO it is non-obvious why he would care if a blob and a commit had
> the same text, and thus the same ID. He just said he didnt want it to
> happen, not why.

But we don't need to understand why to know it's part of the core design.

If something is part of the core design as a rule it's better to not
mess with it.

> I can imagine some aesthetic reasons why you might want to ensure that
> no blob has the same ID as a commit, and I can imagine it might make
> debugging easier at certain points, but it seems unnecessary given the
> data is write once.

I don't know, but to me separating objects makes sense not just conceptually,
but in practice there's a whole class of potential errors that could be
avoided.

For example, I can think of an implementation of `git prune` that would check
commits first, then trees, then blobs, and the blobs that not reachable from
any trees are removed. But if a commit can have the same id as a blob, you have
to think of a different implementation.

If that's not possible, then you just forget about those potential issues.

> > If even the tiniest change such as adding a period to a commit messange
> > changes the object id (and thus semantically makes it a different
> > object), then it makes sense that changing the type of an object also
> > changes the object id (and thus it's also a different object).
> >
> > And because the id of the parent is included in the content of every
> > commit, the top-level id ensures the integrity of the whole graph.
> >
> > But then comes this notion that the hash algorithm is a property of the
> > repository, and not part of the object storage, which means changing the
> > whole hash algorithm of a repository is considered less of a change than
> > adding a period to the commit message, worse: not a change at all.
> 
> I really dont understand why you think having two hash functions
> producing different results for the same data is comparable to a
> single hash producing different results for different data.

That depends on what you consider the "data" to be.

If you consider the content of a blob to be the data, then you wouldn't
have different results if a commit has the same data: it would be the
same id.

If instead you consider the data to be `type+content`, then you would
have different results.

> In one case you have two different continuum of identifiers, with one
> ID per continuum, and in the other you have two different identifiers
> in the same continuum, and  if you a continuum you would have 4
> different identifiers right? Eg, the two cases are really quite
> different at a fundamental level.

That entirely depends on what data you hash.

If you hash `algo+type+size+data` there's only one id per object.
Period.

> > I am reminded of the warning Sam Smith gave to the Git project [2] which
> > seemed to be unheard, but the notion of cryptographic algorithm agility
> > makes complete sense to me.
> >
> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> Isn't this orthagonal to your other points?

Not if you consider changing the hash algorithm of a repository to be an
important part of its history (more important than adding a period to a
commit).

> > Changing the hash algorithm of one commit should change the object id of
> > that commit, and thus make it semantically a different commit.
> >
> > In other words: an object of type "blob" should never be confused with
> > an object of type "blob:sha-256", even if the content is exactly the
> > same.
> 
> This doesn't make sense to me.  As long as we can distinguish the
> hashes produced by the different hash functions in use we can create a
> mapping of the data that is hashed such that we have a 1:1 mapping of
> identifiers of each type at which point it really doesn't matter which
> hash function is used.

Yes, we *can*, that doesn't mean we *should*.

If you do `git commit --ammend --signoff` to add your `Signed-off-by` to
a commit, there's a 1:1 mapping from the original commit, to the new
one, but conceptually in git they are different objects.

I recall Linus Torvalds mentioned he used Monotone as a guideline of
what *not* to do. In Monotone you could add the equivalent of
`Signed-off-by` without changing the hash of the commit, in fact, you
could add any metadata if I recall correctly. But this opens a whole can
of worms because now how do you know you have all the metadata relevant
to the commit?

Making all the metadata of a commit part of the commit solves the
integrity problem Monotone had at the cost of making git commits
essentially immutable: any change means it's a different commit.

If making *any* change in the object, makes it conceptually a different
object, including the type of the object, how on Earth is changing the
hash algorithm not considered a change?

This object:

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Is considered different from this object:

  ❯ git hash-object -t tree /dev/null
  4b825dc642cb6eb9a060e54bf8d69288fbee4904

That's why they have a different hash.

Why would these objects be considered the same?

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

  ❯ git hash-object -t blob /dev/null
  473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813

It makes *zero* sense that adding a period changes the object, adding a
s-o-b changes the object, changing the type changes the object, but
changing the hash algorithm does not.

> > The fact that apparently it's so easy to clone a repository with
> > the wrong hash algorithm should give developers pause, as it means the
> > whole point of using cryptographic hash algorithms to ensure the
> > integrity of the commit history is completely gone.
> 
> This is a leap too far. The fact that it is "so easy to clone a repo
> with the wrong hash algorithm" is completely orthogonal to the
> fundamental principles of hash identifiers from strong hash functions.

Only if you think changing the hash algorithm is a less important part
of an object than adding a period.

Do you honestly think these two should be considered the same object?

a)

  tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

b)

  tree 6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

> You seem to be deriving grand conclusions from what sounds to me like
> a simple bug/design-oversight.

I think you are dismissing the brilliant idea that made git's object
storage model so successful.

> > I have not been following the SHA-1 -> OID discussions, but I
> > distinctively recall Linus Torvalds mentioning that the choice of using
> > SHA-1 wasn't even for security purposes, it was to ensure integrity.
> > When I do a `git fetch` as long as the new commits have the same SHA-1
> > as parent as the SHA-1s I have in my repository I can be relatively
> > certain the repository has not been tampered with. Which means that if I
> > do a `git fetch` that suddenly brings SHA-256 commits, some of them must
> > have SHA-1 parents that match the ones I currently have. Otherwise how
> > do I know it's the same history?
> 
> So consider what /could/ happen here. You fetch a commit which uses
> SHA-256 into a repo where all of your local commits use SHA-1. The
> commit you fetched says its parent is some SHA-256 ID you don't know
> about as all your ID's are SHA-1. So git then could go and construct
> an index, hashing each item using SHA-256 instead of SHA-1, and using
> the result to build a bi-directional mapping from SHA-1 to SHA-256 and
> back.  All it has to do then is look into the mapping to find if the
> SHA-256 parent id is present in your repo. If it is then you know it's
> the same history.

Yeah, that *could* happen. Doesn't mean it *should*.

> The key point here is that if you ignore SHAttered artifacts (which
> seems reasonable as you can detect the attack during hashing)  you can
> build a 1:1 map of SHA-1 and SHA-256 ids.  Once you have that mapping
> it doesn't matter which ID is used.

It may not matter to you. It matters to me.

> > Maybe that's one of the reasons people don't seem particularly eager to
> > move away from SHA-1:
> 
> Maybe, but it doesn't make sense to me.  You seem to be putting undue
> weight on an unnecessary aspect of the git design: there doesn't seem
> to be a reason for Linuses "no aliasing" policy, and it seems like one
> could build a git-a-like without it without suffering any significant
> penalties.

The fact you don't see a reason doesn't mean there isn't one. This is an
argument from ignorance fallacy.

The appendix was considered a vestigial organ because nobody could see a
reason for it. Did that mean it served no puprpose? No.

> Regardless, provided that the hash functions allow a 1:1 mapping of
> ID's (which is assumed by using "collision free hash functions"), it
> seems like it really doesn't matter which hash is used at any given
> time.

People who designed CVS, Subversion, Monotone, and Mercurial didn't see
a reason for many of Git's design choices either.

I'd argue they were wrong.

I think changing the hash algorithm of a commit matters.

Cheers.

-- 
Felipe Contreras

next prev parent reply	other threads:[~2023-05-03 18:20 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-05 10:28 git clone of empty repositories doesn't preserve hash Adam Majer
2023-04-05 19:04 ` Junio C Hamano
2023-04-05 19:47   ` Adam Majer
2023-04-05 20:01     ` Jeff King
2023-04-05 20:40       ` Junio C Hamano
2023-04-05 21:15         ` Junio C Hamano
2023-04-05 21:26           ` Jeff King
2023-04-05 22:48           ` brian m. carlson
2023-04-06 13:11           ` Adam Majer
2023-04-25 21:35           ` brian m. carlson
2023-04-25 22:24             ` Junio C Hamano
2023-04-25 23:12             ` Junio C Hamano
2023-04-26  0:20               ` brian m. carlson
2023-04-26 11:25                 ` Jeff King
2023-04-26 15:08                   ` Junio C Hamano
2023-04-26 15:13                     ` [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone" Junio C Hamano
2023-04-26 21:06                       ` brian m. carlson
2023-04-27  4:46                     ` git clone of empty repositories doesn't preserve hash Jeff King
2023-04-26 10:51               ` Jeff King
2023-04-26 15:42                 ` Junio C Hamano
2023-04-26 20:40                 ` brian m. carlson
2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
2023-04-26 21:14                       ` Junio C Hamano
2023-04-26 21:28                         ` brian m. carlson
2023-04-27  5:00                           ` Jeff King
2023-04-27  5:30                       ` Jeff King
2023-04-27 20:40                         ` Junio C Hamano
2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
2023-04-26 21:18                       ` Junio C Hamano
2023-04-26 21:33                       ` Junio C Hamano
2023-04-27  5:43                         ` Jeff King
2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
2023-05-03  9:03                             ` Adam Majer
2023-05-03 15:44                               ` Felipe Contreras
2023-05-03 17:21                                 ` Adam Majer
2023-05-08  0:34                                   ` Felipe Contreras
2023-05-03  9:09                             ` demerphq
2023-05-03 18:20                               ` Felipe Contreras [this message]
2023-05-03 22:54                             ` brian m. carlson
2023-05-08  2:00                               ` Felipe Contreras
2023-05-08 21:38                                 ` brian m. carlson
2023-05-09 10:32                                   ` Oswald Buddenhagen
2023-05-09 16:47                                     ` Junio C Hamano
2023-04-26 21:12                     ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-04-27  4:56                   ` git clone of empty repositories doesn't preserve hash Jeff King
2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
2023-05-01 17:00                     ` [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
2023-05-01 22:40                       ` Jeff King
2023-05-01 22:51                         ` Junio C Hamano
2023-05-01 17:37                     ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
2023-05-17 19:24                     ` [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
2023-05-17 21:48                     ` [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-05-17 22:28                       ` brian m. carlson
2023-05-18 18:28                     ` Jeff King
2023-05-19 15:32                       ` brian m. carlson
2023-04-05 21:23         ` git clone of empty repositories doesn't preserve hash Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6452a5fdf2bd9_6822945f@chronos.notmuch \
    --to=felipe.contreras@gmail.com \
    --cc=adamm@zombino.com \
    --cc=demerphq@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.