All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Brian O'Mahoney" <omb@khandalf.com>
To: unlisted-recipients:; (no To-header on input)
Cc: Git Mailing List <git@vger.kernel.org>
Subject: The git repo format
Date: Wed, 27 Apr 2005 21:47:04 +0200	[thread overview]
Message-ID: <426FEC38.1060507@khandalf.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0504271154470.18901@ppc970.osdl.org>

In understanding how to work with 'git' I had a number of initial
difficulties which are mostly covered by the e-mail from Linus below.

Most of these are already covered in the README:

for objects, ie blob, commit, tag, tree: inflate, then
<type>\s<size>\0<data>

where <data> is in the form, described by Linus below

when you look at them closely, all the formats are simple,
un-ambiguous, and very easy to parse.

The index is also easy to parse, but there is a detail,
after the 3-int header the records are padded to a multiple
of 8 bytes. The detail is in cache.h.

Maybe the README needs to re-inforce this.

Brian

> I repeat: git does not do any free-form parsin AT ALL.
========================================================

 The links are in well-defined places, and you do not ever search for them.

And that's really very very important.


> For a "commit", the format is
> 
>  - first line is exactly 46 bytes: five bytes of "tree ", 40 bytes of hex 
>    sha1, and one byte of "\n".
> 
>    NOTHING ELSE. Not extra spaces at the end, not extra spaces at the 
>    beginning or the middle. It's ASCII, but it's not free-format ASCII.
> 
>  - the next <n> (where 'n' can be 0 or more) lines are _exactly_ 48 bytes
>    each:  seven bytes of "parent ", 40 bytes of hex sha1, and one byte of 
>    "\n".
> 
>    NOTHING ELSE.
> 
>  - the next lines are "author " and "committer ". They have well-defined 
>    delimters for their fields, and no sha1's. The fields cannot contain 
>    '<', '>' or newlines, since those are the field/line delimeters.
> 
> There is no free-format text _anywhere_ that git parses. No room for 
> guesses, no room for mistakes, no room for anything half-way questionable.
> 
> And fsck actually enforces this. We do _not_ just use "gets()" to read one 
> line at a time. We literally verify that the lines are 46/48 bytes long, 
> and have the delimeters in the expected places.
> 
> Same goes for "tree" and "tag" objects. They all have fixed-format stuff. 
> A "tree" entry is always
> 
> 	"%o <space> %s" \0 [ 20 bytes of sha1 ]
> 
> with "%o" being "mode", and "%s" being "path". We don't guess. 
> 
> And this really is _important_. Exactly because we name things by the SHA1
> hash of the contents, we MUST NOT have flexible formats. Having a format
> which allows non-canonical representations (extra spaces etc) would mean
> that two trees that were identical would depend on how you happened to
> format them.
> 
> So there's really two issues:
>  - we don't guess or parse contents. We have strict rules, and that makes 
>    git more reliable. There are no gray areas. There's "right" and there 
>    is "wrong", and the right one works, and the wrong one gets flagged as 
>    being wrong and the tools refuse to touch it.
>  - there is only _one_ right way to do things, and that means that the 
>    the content is well-defined, and thus the SHA1 of the content is 
>    well-defined.
> 
> For example, another rule is that a "tree" object is always sorted by 
> the bytes in the filename (not by entry, btw: a directory called "foo" 
> will sort as "foo/", even though the _entry_ only shows "foo"). That rule 
> not only makes a lot of operations faster, but again, it means that there 
> is only _one_ way to represent a tree validly.
> 
> IOW, you _cannot_ represent a tree any other way (and I've been too lazy
> to check this in fsck, but it's alway sbeen my plan), and that is exactly 
> why we can just compare the hashes of the results - because there is no 
> random component of "layout" in the contents.
> 
> This really is important. It means that if you get to the same two tree
> contents in totally unrelated ways (you unpack a tar-file and encode it in
> git, or you have 5 years of git history and check it out), the "tree" will
> match _exactly_. There's no history. There's no "optional" stuff. Since
> the contents of the trees are the same, the SHA1 of the two trees will be
> the same. Exactly because git refuses to touch any free-format stuff.


  reply	other threads:[~2005-04-27 19:41 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-27  5:43 A shortcoming of the git repo format H. Peter Anvin
2005-04-27 15:00 ` C. Scott Ananian
2005-04-27 15:22 ` Linus Torvalds
2005-04-27 18:03   ` H. Peter Anvin
2005-04-27 18:32     ` Dave Jones
2005-04-27 18:47       ` H. Peter Anvin
2005-04-27 22:51         ` Jon Seymour
2005-04-27 19:15       ` Linus Torvalds
2005-04-27 19:39       ` Petr Baudis
2005-04-27 19:11     ` Linus Torvalds
2005-04-27 19:47       ` Brian O'Mahoney [this message]
2005-04-27 20:40       ` H. Peter Anvin
2005-04-27 20:49         ` Tom Lord
2005-04-27 20:59           ` H. Peter Anvin
2005-04-28  0:57           ` Linus Torvalds
2005-04-28  1:34             ` Paul Jackson
2005-04-28  2:14             ` Tom Lord
2005-04-28  3:37             ` Ryan Anderson
2005-04-28  8:31             ` Morgan Schweers
2005-04-28 15:08             ` Barry Silverman
2005-04-27 20:56         ` Linus Torvalds
2005-04-28  0:45           ` David A. Wheeler
2005-04-28  0:46             ` David Lang
2005-04-27 23:50         ` Daniel Barkalow
2005-04-27 23:56           ` H. Peter Anvin
2005-04-28  1:51             ` Daniel Barkalow
2005-04-28  1:56               ` H. Peter Anvin
2005-04-28 13:39     ` David Woodhouse
2005-04-27 20:58 ` Gerhard Schrenk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=426FEC38.1060507@khandalf.com \
    --to=omb@khandalf.com \
    --cc=git@vger.kernel.org \
    --cc=omb@bluewin.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.