* [PATCH/RFC] Document format of basic Git objects
@ 2012-02-15 13:22 Nguyễn Thái Ngọc Duy
2012-02-15 17:31 ` Jonathan Nieder
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-15 13:22 UTC (permalink / raw)
To: git; +Cc: Nguyễn Thái Ngọc Duy
Basic objects' format is pretty simple and (I think) well-known.
However it's good that we document them. At least we can keep track of
the evolution of an object format. The commit object, for example,
over the years has learned "encoding" and recently GPG signing.
This is just a draft text with a bunch of fixmes. But I'd like to hear
from the community if this is a worthy effort. If so, then whether
git-cat-file is a proper place for it. Or maybe we put relevant text
in commit-tree, write-tree and mktag, then refer to them in cat-file
because cat-file can show raw objects.
So comments?
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
PS. This also makes me wonder if tag object supports "encoding".
Haven't dug down in history yet.
Documentation/git-cat-file.txt | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index 2fb95bb..e3dd6d9 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -100,6 +100,46 @@ for each object specified on stdin that does not exist in the repository:
<object> SP missing LF
------------
+OBJECT FORMAT
+-------------
+
+Tree object consists of a series of tree entries sorted in memcmp()
+order by entry name. Each entry consists of:
+
+- POSIX file mode encoded in octal ascii
+- One space character
+- Entry name terminated by one character NUL
+- 20 byte SHA-1 of the entry
+
+Tag object is ascii plain text in a format similar to email format
+(RFC 822). It consists of a header and a body, separated by a blank
+line. The header includes exactly four fields in the following order:
+
+1. "object" field, followed by SHA-1 in ascii of the tagged object
+2. "type" field, followed by the type in ascii of the tagged object
+ (either "commit", "tag", "blob" or "tree" without quotes,
+ case-sensitive)
+3. "tag" field, followed by the tag name
+4. "tagger" field, followed by the <XXX, to be named>
+
+The tag body contains the tag's message and possibly GPG signature.
+
+Commit object is in similar format to tag object. The commit body is
+in plain text of the chosen encoding (by default UTF-8). The commit
+header has the following fields in listed order
+
+1. One "tree" field, followed by the commit's tree's SHA-1 in ascii
+2. Zero, one or more "parent" field
+3. One "author" field, in <XXX to be named> format
+3. One "committer" field, in <XXX to be named> format
+4. Optionally one "encoding" field, followed by the encoding used for
+ commit body
+5. GPG signature (fixme)
+
+More headers after these fields are allowed. Unrecognized header
+fields must be kept untouched if the commit is rewritten. However, a
+compliant Git implementation produces the above header fields only.
+
GIT
---
Part of the linkgit:git[1] suite
--
1.7.8.36.g69ee2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC] Document format of basic Git objects
2012-02-15 13:22 [PATCH/RFC] Document format of basic Git objects Nguyễn Thái Ngọc Duy
@ 2012-02-15 17:31 ` Jonathan Nieder
2012-02-15 19:48 ` Junio C Hamano
2012-02-19 4:15 ` [PATCH/RFC v2] " Nguyễn Thái Ngọc Duy
2 siblings, 0 replies; 17+ messages in thread
From: Jonathan Nieder @ 2012-02-15 17:31 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy; +Cc: git, Shawn O. Pearce, Scott Chacon
Hi,
Nguyễn Thái Ngọc Duy wrote:
> Basic objects' format is pretty simple and (I think) well-known.
> However it's good that we document them. At least we can keep track of
> the evolution of an object format. The commit object, for example,
> over the years has learned "encoding" and recently GPG signing.
Yes, I agree.
> This is just a draft text with a bunch of fixmes. But I'd like to hear
> from the community if this is a worthy effort. If so, then whether
> git-cat-file is a proper place for it. Or maybe we put relevant text
> in commit-tree, write-tree and mktag, then refer to them in cat-file
> because cat-file can show raw objects.
About where to place this text, I am of two minds.
1. On one hand, from the user's perspective it would be most intuitive
to place it in a separate git-object(5) manual page. That way,
gitrepository-layout(5), git-fsck(1), git-hash-object(1), the user
manual, and so on would all have one document to link to.
2. On the other hand, from a development perspective I suspect it
would be valuable to put it in the git-fsck(1) page, since that would
have two consequences:
- when changing the documentation, this would provide a reminder to
update fsck.c at the same time
- when changing fsck.c, this would provide a reminder to update the
documentation at the same time
Ok, (2) was tongue in cheek. :) I believe this information belongs in
a dedicated page with a name like gitobject(5), and that you are right
to put it in user-visible documentation instead of hiding it in
Documentation/technical, since it is information needed if one is to
use "git hash-object -w" correctly.
Ok, on to the text itself.
[...]
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -100,6 +100,46 @@ for each object specified on stdin that does not exist in the repository:
> <object> SP missing LF
> ------------
>
> +OBJECT FORMAT
> +-------------
> +
> +Tree object consists of a series of tree entries sorted in memcmp()
> +order by entry name.
Missing article ("A tree object", "The tree object", or "Each tree
object"). More importantly, the curious reader might want to know
whether a tree object is supposed to contain entries pointing to other
tree objects for subdirectories or whether the subdirectory's
information is included inline like in the index.
I guess I would expect something like (stealing from the user manual):
TREE OBJECTS
------------
A tree object contains a list of entries, each with a mode,
object type, object name, and filename, sorted by filename. It
represents the contents of a single directory tree.
The object type may be a blob, representing the contents of a
file, another tree, representing the contents of a subdirectory,
or a commit (representing a subproject). Since trees and blobs,
like all other objects, are named by a hash of their contents,
two trees have the same object name if and only if their
contents (including, recursively, the contents of all
subdirectories) are identical. This allows git to quickly
determine the differences between two related tree objects,
since it can ignore any entries with identical object names.
Note that the files all have mode 644 or 755: git actually only
pays attention to the executable bit.
Encoding
~~~~~~~~
Entries are of variable length and self-delimiting. Each entry
consists of
- a POSIX file mode in octal representation
- exactly one space (ASCII SP)
- filename for the entry, as a NUL-terminated string
- 20-byte binary object name
The mode should be 100755 (executable file), 100644 (regular
file), 120000 (symlink), 40000 (subdirectory), or 160000
(subproject), with no leading zeroes. Modes with one leading
zero and the synonym 100664 for 100644 are also accepted for
historical reasons.
The filename may be an arbitrary nonempty string of bytes, as
long as it contains no '/' or NUL character.
The associated object must be a valid blob if the mode indicates
a file or symlink, tree if it indicates a subdirectory, or
commit if it indicates a subproject. The blob associated to a
symlink entry indicates the link target and its content not
have any embedded NULs.
By the way, git fsck seems to tolerate the old "flat tree" format
(i.e., that condition is FSCK_WARN and not FSCK_ERROR), but I don't
see any code supporting it elsewhere in git. Bug?
Sorting
~~~~~~~
... no duplicates, sort order, etc ...
[...]
> +Tag object is ascii plain text in a format similar to email format
> +(RFC 822). It consists of a header and a body, separated by a blank
> +line.
The above description makes me worry that the reader might try some
things that are allowed by RFC 822: rearranging header fields,
continuation lines, and so on.
> The header includes exactly four fields in the following order:
> +
> +1. "object" field, followed by SHA-1 in ascii of the tagged object
> +2. "type" field, followed by the type in ascii of the tagged object
> + (either "commit", "tag", "blob" or "tree" without quotes,
> + case-sensitive)
> +3. "tag" field, followed by the tag name
> +4. "tagger" field, followed by the <XXX, to be named>
> +
> +The tag body contains the tag's message and possibly GPG signature.
This part looks good. Stealing from the user manual again, maybe:
TAG OBJECTS
-----------
A tag object contains an object, object type, tag name, the name
of the person ("tagger") who created the tag, and a message,
which may contain a signature.
------------------------------------------------
$ git cat-file tag v1.5.0
object 437b1b20df4b356c9342dac8d38849f24ef44f27
type commit
tag v1.5.0
tagger Junio C Hamano <junkio@cox.net> 1171411200 +0000
GIT 1.5.0
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQBF0lGqwMbZpPMRm5oRAuRiAJ9ohBLd7s2kqjkKlq1qqC57SbnmzQCdG4ui
nLE/L9aUXdWeTFPron96DLA=
=2E+0
-----END PGP SIGNATURE-----
------------------------------------------------
More precisely, a tag contains at least five lines:
1. "object", followed by a space, followed by the 40-character
textual object name of the tagged object
2. "type" + SP + the type of the tagged object ("commit", "tag",
"blob", or "tree")
3. "tag" + SP + the name of the tag
4. "tagger" + SP + an ident string
5. a blank line
Any remaining text after these lines forms the tag message.
The object field must point to a valid object of type indicated
by the type field. The tag name can be an arbitrary string
without NUL bytes or embedded newlines; in practice it usually
follows the restrictions described in git-check-ref-format(1).
[...]
> +
> +Commit object is in similar format to tag object. The commit body is
> +in plain text of the chosen encoding (by default UTF-8). The commit
> +header has the following fields in listed order
Same considerations apply here --- I'd suggest stealing text from the
commit-object section of the user manual and from commit logs.
Hope that helps,
Jonathan
> +
> +1. One "tree" field, followed by the commit's tree's SHA-1 in ascii
> +2. Zero, one or more "parent" field
> +3. One "author" field, in <XXX to be named> format
> +3. One "committer" field, in <XXX to be named> format
> +4. Optionally one "encoding" field, followed by the encoding used for
> + commit body
> +5. GPG signature (fixme)
> +
> +More headers after these fields are allowed. Unrecognized header
> +fields must be kept untouched if the commit is rewritten. However, a
> +compliant Git implementation produces the above header fields only.
> +
> GIT
> ---
> Part of the linkgit:git[1] suite
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC] Document format of basic Git objects
2012-02-15 13:22 [PATCH/RFC] Document format of basic Git objects Nguyễn Thái Ngọc Duy
2012-02-15 17:31 ` Jonathan Nieder
@ 2012-02-15 19:48 ` Junio C Hamano
2012-02-16 7:12 ` Junio C Hamano
2012-02-19 4:15 ` [PATCH/RFC v2] " Nguyễn Thái Ngọc Duy
2 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2012-02-15 19:48 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy; +Cc: git
Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
> This is just a draft text with a bunch of fixmes. But I'd like to hear
> from the community if this is a worthy effort. If so, then whether
> git-cat-file is a proper place for it. Or maybe we put relevant text
> in commit-tree, write-tree and mktag, then refer to them in cat-file
> because cat-file can show raw objects.
>
> So comments?
This _only_ describes the payload (i.e. without the 'blob <size>\n' header
used in loose object, in other words, what read_object() may return).
There should be a sentence to stress this. As many Git intros (including
my book) begin with the "a short header 'blob <size>\n' concatenated with
the contents is hashed to compute the object name" picture, it would be
confusing unless you explicitly say that you are only describing the
"contents" part.
It makes sense to mention that the cat-file subcommand is used to obtain
this raw data somewhere in the documentation, but I would say the content
of this patch belongs to Documentation/technical/ somewhere.
> PS. This also makes me wonder if tag object supports "encoding".
I do not think so.
> +OBJECT FORMAT
> +-------------
> +
> +Tree object consists of a series of tree entries sorted in memcmp()
> +order by entry name. Each entry consists of:
> +
> +- POSIX file mode encoded in octal ascii
Add ", no 0 padding to the right" at the end, as I heard that every
imitation of Git gets this wrong in its first version.
> +- One space character
> +- Entry name terminated by one character NUL
> +- 20 byte SHA-1 of the entry
> +Tag object is ascii plain text in a format similar to email format
> +(RFC 822). ...
Do not mention "email format (RFC 822)" at all. The differences are
significant enough that it only confuses the readers.
We do not have colon at the end of the header, we do not promise to parse
field names case insensitively, and the way continuation lines are parsed
is totally different (a "similar" construct in RFC 2822 is "folded header
lines", but it is signalled by "folding white space", it discards the
end-of-line from the previous line and makes the result a logical single
line. Our continuation lines are introduced by a single SP and the result
of concatenation keeps the end-of-line from the previous lines, making the
result multiple lines).
Also we do not promise that the lines in the header part are always
<field,value> pairs. So rephrase this while carefully distinguishing
between "a line in header" and "field".
A commit or a tag object begins with the "header" that consists of one
or more lines delimited by LF. The end of the header is signalled by
an empty line.
A "continuation line" in the header begins with a SP. The remainder
of the line, after removing that SP, is concatenated to the previous
line, while retaining the LF at the end of the previous line.
When a line in the header begins with a letter other than SP, and has
at least one SP in it, it is called a "field". A field consists of
the "field name", which is the string before the first SP on the line,
and its "value", which is everything after that SP. When the value
consists of multiple lines, continuation lines are used.
More than one field with the same name can appear in the header of an
object, and the order in which they appear is significant.
In a commit object, the header begins with the following fields that
have such and such meaning.
In a tag object, the header begins with the following fields...
After these defined fields, newer versions of git may add more lines
in the header. Some of them may be fields, others might not be. The
implementations to parse commit and tag objects must ignore lines in
the header that it does not understand without triggering an error.
> ... It consists of a header and a body, separated by a blank
> +line. The header includes exactly four fields in the following order:
> +
If you hand-craft a tag-like object that has unknown field after these
four, how badly the current implementations behave?
> +1. "object" field, followed by SHA-1 in ascii of the tagged object
> +2. "type" field, followed by the type in ascii of the tagged object
> + (either "commit", "tag", "blob" or "tree" without quotes,
> + case-sensitive)
> +3. "tag" field, followed by the tag name
> +4. "tagger" field, followed by the <XXX, to be named>
> +The tag body contains the tag's message and possibly GPG signature.
> +
> +Commit object is in similar format to tag object. The commit body is
It is strange that you introduce tag and then commit. I would think that
readers expect to see them presented in the usual blob/tree/commit/tag
order.
> +in plain text of the chosen encoding (by default UTF-8). The commit
> +header has the following fields in listed order
> +
> +1. One "tree" field, followed by the commit's tree's SHA-1 in ascii
> +2. Zero, one or more "parent" field
> +3. One "author" field, in <XXX to be named> format
> +3. One "committer" field, in <XXX to be named> format
> +4. Optionally one "encoding" field, followed by the encoding used for
> + commit body
> +5. GPG signature (fixme)
> +
> +More headers after these fields are allowed. Unrecognized header
> +fields must be kept untouched if the commit is rewritten.
Replace the first sentence with "New kinds of fields may be added in later
versions of git." and drop the second one entirely. Depending on the
reason and nature of the "rewrite", we may or may not want to keep these
unknown header lines, so it is best to leave the behaviour unspecified.
For example, it makes sense to retain "mergetag" because it is about the
parent, not the resulting commit. It does not make sense to keep "gpgsig"
because it is about the commit you are rewriting to invalidate that old
signature.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC] Document format of basic Git objects
2012-02-15 19:48 ` Junio C Hamano
@ 2012-02-16 7:12 ` Junio C Hamano
0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2012-02-16 7:12 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nguyễn Thái Ngọc Duy, git
Junio C Hamano <gitster@pobox.com> writes:
>> +- POSIX file mode encoded in octal ascii
>
> Add ", no 0 padding to the right" at the end, as I heard that every
> imitation of Git gets this wrong in its first version.
Ehh, of course no 0 padding on the LEFT hand side.
Rice-bowl with left hand, chopsticks with right hand. I always mix these
up. Sorry.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH/RFC v2] Document format of basic Git objects
2012-02-15 13:22 [PATCH/RFC] Document format of basic Git objects Nguyễn Thái Ngọc Duy
2012-02-15 17:31 ` Jonathan Nieder
2012-02-15 19:48 ` Junio C Hamano
@ 2012-02-19 4:15 ` Nguyễn Thái Ngọc Duy
2012-02-19 8:39 ` Junio C Hamano
2012-02-19 18:07 ` Manually decoding a git object Philip Oakley
2 siblings, 2 replies; 17+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-19 4:15 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Jonathan Niedier, Shawn O. Pearce, Scott Chacon,
Nguyễn Thái Ngọc Duy
Still draft for discussion. Of three people who participated on this
thread, two favor a man page (me and Jonathan), one techincal/
(Junio), so let's put it as a man page for now.
Some notes:
- I'm tempted to include pack-format.txt because I also document
loose object format here. If it's included and
gitrepository-layout.txt links to this, we have a quite complete
documentation of what's inside $GIT_DIR (assuming rebase-apply and
such are of private use)
- Not sure if we fix the order of gpgsig and mergetag, or they can be
mixed together. Also not sure if we can have multiple gpgsig, I
haven't checked the code.
- I skipped the experimental loose object format (it's what it's
called in sha1_file.c). I think we can call it deprecated and move
on.
- Do we assume tag/commit header in utf-8 or ascii?
- We don't do any encoding on ident strings, right?
Mostly-written-by: Jonathan Nieder <jrnieder@gmail.com>
Mostly-written-by: Junio C Hamano <gitster@pobox.com>
Remaining-stolen-from: Documentation/user-manual.txt
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
Documentation/git-object.txt | 273 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 273 insertions(+), 0 deletions(-)
create mode 100644 Documentation/git-object.txt
diff --git a/Documentation/git-object.txt b/Documentation/git-object.txt
new file mode 100644
index 0000000..359af37
--- /dev/null
+++ b/Documentation/git-object.txt
@@ -0,0 +1,273 @@
+git-object(5)
+=============
+
+NAME
+----
+git-object - Git object format
+
+SYNOPSIS
+--------
+$GIT_DIR/objects/*
+
+OBJECT ON-DISK REPRESENTATIONS
+------------------------------
+Objects can be stored on disk as loose (unpacked) objects or
+in packs. Loose objects are in $GIT_DIR/objects/[0-9a-f][0-9a-f]
+directories. Packs are $GIT_DIR/objects/pack/pack-*.pack. Each pack
+has a corresponding index file to speed up pack access.
+
+Object SHA-1
+~~~~~~~~~~~~
+An object SHA-1 is calculated on its header and payload. The content
+to be consumed by SHA-1 calculation is:
+
+- Object type in ascii, either "commit", "tree", "tag" or "blob"
+ (without quotes)
+- One space (ASCII SP)
+- The payload length in ascii canonical decimal format
+- ASCII NUL
+- Object payload
+
+Loose objects
+~~~~~~~~~~~~~
+Loose objects are simply a compressed form using zlib(3) of the
+object's header and payload, as described in Object SHA-1 section
+above.
+
+Packed objects
+~~~~~~~~~~~~~~
+FIXME maybe include Documentation/pack-format.txt
+
+BLOB OBJECTS
+------------
+Blob object payload is file data.
+
+TREE OBJECTS
+------------
+Tree object payload contains a list of entries, each with a mode,
+object type, object name, and filename, sorted by filename. It
+represents the contents of a single directory tree.
+
+The object type may be a blob, representing the contents of a file,
+another tree, representing the contents of a subdirectory, or a commit
+(representing a subproject). Since trees and blobs, like all other
+objects, are named by a hash of their contents, two trees have the
+same object name if and only if their contents (including,
+recursively, the contents of all subdirectories) are identical. This
+allows git to quickly determine the differences between two related
+tree objects, since it can ignore any entries with identical object
+names.
+
+Note that the files all have mode 644 or 755: git actually only pays
+attention to the executable bit.
+
+Encoding
+~~~~~~~~
+Entries are of variable length and self-delimiting. Each entry
+consists of
+
+- a POSIX file mode in octal ascii representation, no 0 padding to the
+ left
+- exactly one space (ASCII SP)
+- filename for the entry, as a NUL-terminated string
+- 20-byte binary object name
+
+The mode should be 100755 (executable file), 100644 (regular file),
+120000 (symlink), 40000 (subdirectory), or 160000 (subproject), with
+no leading zeroes. Modes with one leading zero and the synonym 100664
+for 100644 are also accepted for historical reasons. Other modes are
+not accepted.
+
+The filename may be an arbitrary nonempty string of bytes, as long as
+it contains no '/' or NUL character.
+
+The associated object must be a valid blob if the mode indicates a
+file or symlink, tree if it indicates a subdirectory, or commit if it
+indicates a subproject. The blob associated to a symlink entry
+indicates the link target and its content not have any embedded NULs.
+
+Sorting
+~~~~~~~
+Entries are sorted by memcmp(3) on file name. No duplicate file names
+allowed.
+
+COMMIT OBJECT
+-------------
+The commit object links a physical state of a tree with a description
+of how we got there and why. Commit object payload contains the
+associated tree SHA-1, parent commits's SHA-1, author and comitter
+information.
+
+------------------------------------------------
+$ git cat-file commit 81d48f0aee54
+tree 093f37084c133795e4ce71befa57185328737171
+parent f5e4e20faa1eee3feaa0394897bbd1aca544e809
+parent 661db794eb8179c7bea02f159bb691a2fff4a8e0
+parent 14c173eb63432ba5d0783b6c4b23a8fe0c76fb0f
+author Linus Torvalds <torvalds@linux-foundation.org> 1326576355 -0800
+committer Linus Torvalds <torvalds@linux-foundation.org> 1326576355 -0800
+mergetag object 661db794eb8179c7bea02f159bb691a2fff4a8e0
+ type commit
+ tag devicetree-for-linus
+ tagger Grant Likely <grant.likely@secretlab.ca> 1326520038 -0700
+
+ 2nd set of device tree changes for v3.3
+ -----BEGIN PGP SIGNATURE-----
+ Version: GnuPG v1.4.11 (GNU/Linux)
+
+ iQIcBAABAgAGBQJPERbzAAoJEEFnBt12D9kBmDIP/R9Vspc6yhjSAEvdp/VET2gi
+ TgAQfdp4VuYjjIt4cUPO5UQU9kw478GjTuP2blZEC9DlG1jSf/L8U+A7FHJIVVzU
+ QfjwV1Lqaqk+sQQ1bsp2ixbesKECmqU9IweOIFmn0U2ZD+xlPFIpE2iTKEqymejf
+ PVZsFlkVmhQZgudPNieyZMjQpQ9hEb6UcSfXT//nmoRRxCL/PiMHGRx3UdS3eRe7
+ FApSW0Mty/PD07QXPsDjg1GvK59Gf6R1/4Bd31+rXEz9yaxf4I4I02fL553NDVIt
+ tAPfo/4YKW1rLMWQRkAUqCaMk9v/DWxeWYbbiJNZ2R3kys9o8k26XXxvcuYnecS2
+ G8DDJpmOikbN3Gvlskh40Tn3TJb5Wlgc7o/10L/fq6FovS4Uk7yUeFMqXUYfl8TU
+ ziIlrlt9IGabXBN4JKJl3OabgkeO+Oz9DKhTQFJLY4/121LAtFVk3xd316mY+wpX
+ mI83VmWMlp3sK+OLr+UdMTCXZvSIpu3KlGKMpAssHKUKxIV20NHLFNbm94/ywXBn
+ Zb8arjcv7+WzwhSqQJj851cq4/sEYx5HB4wU5Nm5SXBwcO3ixiij6lHCoHU+NudR
+ eyPIFLfrzwnUu3yTRgUfAnkgOce+2I+vUsU4pXUR6FyK73wSmm0+4WXQfB+OBlwD
+ 2O1RjZedZCb6zzf17H2k
+ =mup8
+ -----END PGP SIGNATURE-----
+mergetag object 14c173eb63432ba5d0783b6c4b23a8fe0c76fb0f
+ type commit
+ tag spi-for-linus
+ tagger Grant Likely <grant.likely@secretlab.ca> 1326520366 -0700
+
+ SPI bug fixes for v3.3
+ -----BEGIN PGP SIGNATURE-----
+ Version: GnuPG v1.4.11 (GNU/Linux)
+
+ iQIcBAABAgAGBQJPERgyAAoJEEFnBt12D9kBRMsP/RBv6kWIb/qD7yJhrdbzJ4Tv
+ 1f7coSytuHupZVpxJstELKPugRmp2R6YeFbKw8P4P/12233Q0FcdKTF6ZE2h3cBp
+ bfCtyyzlFeY/nMfJKkwh37x2fHxNHynCCJEjHhecLday7NKQoTmmafivTfVmolWK
+ /MGjDarTAzC1FaP1xpBnuiI8eCr5WIgb4WmtvOmxIntVT077xggdJLL/Co7fBCqn
+ iibz3U/VyC68kQTGw6ELhnW1d7doHp7H3DJ2gPsh6lzpbv8JAnOMPpD+3Me1DVHE
+ Ay0kxPHV4bqnDyB+uEGppUiNoaTd5InrMAw+udDad60TMwOZzIvMkgxo0PIVM9Mm
+ k6mCcE2+TSnJetueX3cfrS5bRTPxUX7KRDC/WSp67/QPmelbYeRDLR7hrrQVqOPq
+ 5hIKMfz/kTBXcaXk643TEveaZlMuOZxHBYAvsbu5BX/3SQqYFS4POdxdeZVnUf54
+ ITHhftBtrXacCsjKujp0xmKCIpF+8v3yKRxGEQssByv8v+CaymNrEls2vTF8tn5P
+ sAIjPFJYG+IHtDMIsTHOvSPA7uwWYsOVHFEYsbC1758esiBD8+qtfvFS3jAH99z+
+ v2/aGsfMnjYEIsRtSm7PVTybJAo22Gr62yE/Q+rP//O0JaDahgdm009MjUo6BSgg
+ XNhZjQRYAYEExMTjJ2TK
+ =q39P
+ -----END PGP SIGNATURE-----
+
+Merge tags 'devicetree-for-linus' and 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6
+
+2nd set of device tree changes and SPI bug fixes for v3.3
+
+* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6:
+ of/irq: Add interrupts-names property to name an irq resource
+ of/address: Add reg-names property to name an iomem resource
+
+* tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6:
+ spi/tegra: depend instead of select TEGRA_SYSTEM_DMA
+------------------------------------------------
+
+More precisely, a commit object begins with of one or more lines
+delimited by ASCII LF. The end of the header is signalled by an empty
+line. Any remaining text after the empty line forms the commit
+message. The header must not contain NUL.
+
+A "continuation line" in the header begins with an SP. The remainder
+of the line, after removing that SP, is concatenated to the previous
+line, while retaining the LF at the end of the previous line.
+
+When a line in the header begins with a letter other than SP, and has
+at least one SP in it, it is called a "field". A field consists of the
+"field name", which is the string before the first SP on the line, and
+its "value", which is everything after that SP. When the value
+consists of multiple lines, continuation lines are used.
+
+More than one field with the same name can appear in the header of an
+object, and the order in which they appear is significant. A commit
+object can contain these fields in the listed order:
+
+1. one "tree" field with the 40-character textual object name of the
+ associated tree object
+2. zero or more "parent" fields, each with 40-character textual object
+ name of the parent commit object
+3. one "author" field with an ident string
+4. one "committer" field with an ident string
+5. zero or one "encoding" field with an ascii string
+6. zero or more "mergetag" fields with associated tag object content
+7. zero or one "gpgsig" field with gpg signature content
+
+New kinds of fields may be added in later versions of git.
+
+Ident strings
+~~~~~~~~~~~~~
+Ident strings record who's responsible of doing something at what
+time. For a commit, the ident string in "author" line records who is
+the author of the associated changes and when the changes are
+made. The ident string in "committer" line records who commits the
+changes to the repository and at what time.
+
+An ident string consists of an email address and a timestamp. More
+precisely:
+
+1. Optionally, a name
+2. An email address wrapped around by `<` and `>`, followed by one
+ space (ASCII SP)
+3. The number of seconds since Epoch (00:00:00 UTC, January 1, 1970)
+ followed by a space (ASCII SP)
+4. Timezone: either plus or minus sign, followed by 4 decimal digits
+
+Name and email are encoded in UTF-8 and must must not contain ASCII
+NUL characters.
+
+Commit encoding
+~~~~~~~~~~~~~~~
+Encoding field describes that encoding that the commit message is
+encoded in. Encoding names must be recognized by iconv(3). By default,
+commit message is in UTF-8. It's discouraged to use encodings that can
+generate ASCII NUL characters.
+
+TAG OBJECTS
+-----------
+Tag object payload contains an object, object type, tag name, the name
+of the person ("tagger") who created the tag, and a message, which may
+contain a signature.
+
+------------------------------------------------
+$ git cat-file tag v1.5.0
+object 437b1b20df4b356c9342dac8d38849f24ef44f27
+type commit
+tag v1.5.0
+tagger Junio C Hamano <junkio@cox.net> 1171411200 +0000
+
+GIT 1.5.0
+-----BEGIN PGP SIGNATURE-----
+Version: GnuPG v1.4.6 (GNU/Linux)
+
+iD8DBQBF0lGqwMbZpPMRm5oRAuRiAJ9ohBLd7s2kqjkKlq1qqC57SbnmzQCdG4ui
+nLE/L9aUXdWeTFPron96DLA=
+=2E+0
+-----END PGP SIGNATURE-----
+------------------------------------------------
+
+Tag object format resembles commit format. A tag commit may have the
+following fields in listed order:
+
+1. one "object" field with 40-character textual object name of the
+ tagged object
+2. one "type" field with type of the tagged object ("commit", "tag",
+ "blob", or "tree")
+3. one "tag" field with the name of the tag
+4. one "tagger" with an ident string
+
+New kinds of fields may be added in later versions of git.
+
+Any remaining text after the header forms the tag message. Tag message
+has no specified encoding. Anything that does not contain ASCII NUL
+characters are accepted.
+
+The object field must point to a valid object of type indicated by the
+type field. The tag name can be an arbitrary string without NUL bytes
+or embedded newlines; in practice it usually follows the restrictions
+described in linkgit:git-check-ref-format[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
--
1.7.8.36.g69ee2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC v2] Document format of basic Git objects
2012-02-19 4:15 ` [PATCH/RFC v2] " Nguyễn Thái Ngọc Duy
@ 2012-02-19 8:39 ` Junio C Hamano
2012-02-19 9:14 ` Junio C Hamano
2012-02-20 13:55 ` Nguyen Thai Ngoc Duy
2012-02-19 18:07 ` Manually decoding a git object Philip Oakley
1 sibling, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2012-02-19 8:39 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy
Cc: git, Jonathan Niedier, Shawn O. Pearce, Scott Chacon
Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
> Still draft for discussion. Of three people who participated on this
> thread, two favor a man page (me and Jonathan), one technical/
> (Junio), so let's put it as a man page for now.
Personally I do not have strong preference either way.
The original motivation of technical/ was that we wanted to have a place
to keep documentation that would help ourselves, the people who write the
internals of git, even though we did not yet know and did not want to have
to decide if it is a good idea to expose the end users, who may not care
about the gory details of the internal, with reams of such documents.
> - Not sure if we fix the order of gpgsig and mergetag, or they can be
> mixed together. Also not sure if we can have multiple gpgsig.
You can merge a signed tag and then sign the resulting commit yourself,
and the order of the mixing would not matter. Technically a gpgsig is a
signature over the commit object payload without gpgsig lines, so you
could have two or more gpgsigs on the same commit object but from a larger
workflow point of view it would not be so useful, as it would involve
steps like this:
* You prepare a commit object, you may perhaps sign it yourself;
* You expose this commit object to chosen others from whom you want their
signature on it;
* They sign it with "commit -S --amend", but when they do so they make
sure the resulting commit has the same committer/author header as the
original. Note that the resulting commits will all have different
object name, as the object name is over all payload including their
gpgsigs.
* You grab the gpgsig lines from these commits, paste them into the
header part of the original, and then re-hash the result with
"hash-object -w -t commit". The result will have all valid gpgsigs
over the payload in the commit without its gpgsig lines, because the
gpgsig lines from all the signers were generated that way.
* Then you give the general public the resulting commit.
> - I skipped the experimental loose object format (it's what it's
> called in sha1_file.c). I think we can call it deprecated and move
> on.
Good.
> - Do we assume tag/commit header in utf-8 or ascii?
Author-ident is typically utf-8 already, so you cannot assume "ASCII".
> +Object SHA-1
> +~~~~~~~~~~~~
> +An object SHA-1 is calculated on its header and payload. The content
> +to be consumed by SHA-1 calculation is:
> +
> +- Object type in ascii, either "commit", "tree", "tag" or "blob"
> + (without quotes)
> +- One space (ASCII SP)
> +- The payload length in ascii canonical decimal format
"canonical" may make it sound as if the document is more formal, but then
you would have to define what is canonical and what is not somewhere else,
so I would suggest dropping it.
The length of the payload in bytes, represented as a decimal integer.
Also if you spell ASCII, consistently spell it in all-caps.
> +- ASCII NUL
> +- Object payload
----------------------------------------------------------------
> +BLOB OBJECTS
> +------------
> +Blob object payload is file data.
What's the significance of saying "file data" here? In a document that
describes the structure, saying "is uninterpreted sequence of bytes" is
more accurate (the important point is that git does not care what it is)
and covers cases where blob was recorded with "hash-object -w --stdin"
where no such "file data" has ever existed in a 'file". Also a blob may
record contents of a symbolic link ;-).
> +TREE OBJECTS
> +------------
> +Tree object payload contains a list of entries, each with a mode,
> +object type, object name, and filename, sorted by filename. It
> +represents the contents of a single directory tree.
Drop "object type," from this list. It is inferred from the mode. I
personally would prefer to say "path" or "pathname" when the entity
referred to may not be a regular file. I am not sure the last sentence is
necessary, but if you must say something, say "It represents a
directory". It is by definition redundant to say that a tree represents a
"tree". Replace the above with something line this:
... entries, each with a mode, object name and path. The type of
the object is encoded in the "mode":
- 100644 or 100755: the object is a "blob" that records the
contents of a regular non-executable or executable file,
respectively, that exists at the path.
- 120000: the object is a "blob" that records the contents of a
symbolic link that exists at the path.
- 40000: the object is a "tree" that represents a subdirectory
that exists at the path.
- 160000: the object is a "commit" that records the state of a
submodule that exists at the path.
> +The object type may be a blob, representing the contents of a file,
> +another tree, representing the contents of a subdirectory, or a commit
> +(representing a subproject).
and drop the above line.
> +Since trees and blobs, like all other
> +objects, are named by a hash of their contents, two trees have the
> +same object name if and only if their contents (including,
> +recursively, the contents of all subdirectories) are identical. This
> +allows git to quickly determine the differences between two related
> +tree objects, since it can ignore any entries with identical object
> +names.
It does not make sense to say 'trees and blobs' when you explain that a
single top-level tree object defines the entire tree's state. Just say
'trees'. I know you would say "I wanted to say if tree A and tree B are
the same except for the content of a single blob recorded at path P, the
result of hash for A and B would be different", but the same can be said
for a submodule, so singling out 'blob' is incomplete. Also these trees
may record the same set of blobs but tree B may record what tree A had at
path P at path Q, so it is not like the only thing that matter in the tree
is the object names.
I personally do not think it is necessary to have the above paragraph at
all in this object.
> +Note that the files all have mode 644 or 755: git actually only pays
> +attention to the executable bit.
Saying 644 or 755 here is misleading as it does not match any reality
(except for very early incarnation of git). By rewriting the first
paragraph, these two lines can be safely eliminated.
> +Encoding
> +~~~~~~~~
"Encoding" is such a loaded word and does not help clarify what this
section is really about, which is "format of a tree entry", or simply
"Entries".
> +Entries are of variable length and self-delimiting. Each entry
> +consists of
> +
> +- a POSIX file mode in octal ascii representation, no 0 padding to the
> + left
This is not "a POSIX file mode" at all. The mode in a tree entry was
modelled after that, but there is no need to mention it, especially
because POSIX does not define the exact bit assignment for types (the
permission are defined from S_IXOTH to S_IRWXU and S_ISUID/S_ISGID with
exact bit locations) and because of S_IFGITLINK which is clearly not
POSIX. As we have enumerated them in the first paragraph,
The "mode" (see above).
is sufficient here.
> +- exactly one space (ASCII SP)
> +- filename for the entry, as a NUL-terminated string
Again, "pathname" or just "path" for this entire document.
> +- 20-byte binary object name
> +
> +The mode should be 100755 (executable file), 100644 (regular file),
> +120000 (symlink), 40000 (subdirectory), or 160000 (subproject), with
> +no leading zeroes. Modes with one leading zero and the synonym 100664
> +for 100644 are also accepted for historical reasons. Other modes are
> +not accepted.
This is made redundant by the first paragraph above.
> +The filename may be an arbitrary nonempty string of bytes, as long as
> +it contains no '/' or NUL character.
s/, as long as it contains no/; it cannot contain any/
> +The associated object must be a valid blob if the mode indicates a
> +file or symlink, tree if it indicates a subdirectory, or commit if it
> +indicates a subproject. The blob associated to a symlink entry
> +indicates the link target and its content not have any embedded NULs.
I doubt that we should even mention "and its content not have ...". It is
for readlink(2) and symlink(2) to decide.
> +Sorting
> +~~~~~~~
> +Entries are sorted by memcmp(3) on file name. No duplicate file names
> +allowed.
A sentence without a verb seen at the end of this paragraph.
> +COMMIT OBJECT
> +-------------
> +The commit object links a physical state of a tree with a description
> +of how we got there and why.
What is the intended audience and the purpose of this document? If this
were to strictly define and describe the "structure", then "and why" is
inappropriate. It is merely the best-current-practice at the human level
to describe the "why" in their commit log messages---it does not break the
structure if nobody explains "why".
On the other hand, "how we got there" is a good phrase to explain that by
refering to its immediate parents, all the previous histories are also
described.
> +... Commit object payload contains the
> +associated tree SHA-1, parent commits's SHA-1, author and comitter
> +information.
s/.$/, among other things./; as the log message is also part of the
payload.
Start by labeling what the large block of example you are going to throw
at the reader here.
> +------------------------------------------------
> +$ git cat-file commit 81d48f0aee54
> +tree 093f37084c133795e4ce71befa57185328737171
> +parent f5e4e20faa1eee3feaa0394897bbd1aca544e809
> +parent 661db794eb8179c7bea02f159bb691a2fff4a8e0
> +parent 14c173eb63432ba5d0783b6c4b23a8fe0c76fb0f
> +author Linus Torvalds <torvalds@linux-foundation.org> 1326576355 -0800
> +committer Linus Torvalds <torvalds@linux-foundation.org> 1326576355 -0800
> +mergetag object 661db794eb8179c7bea02f159bb691a2fff4a8e0
> + type commit
> + tag devicetree-for-linus
> + tagger Grant Likely <grant.likely@secretlab.ca> 1326520038 -0700
> +
> + 2nd set of device tree changes for v3.3
> + -----BEGIN PGP SIGNATURE-----
> + Version: GnuPG v1.4.11 (GNU/Linux)
> +
> + iQIcBAABAgAGBQJPERbzAAoJEEFnBt12D9kBmDIP/R9Vspc6yhjSAEvdp/VET2gi
> + TgAQfdp4VuYjjIt4cUPO5UQU9kw478GjTuP2blZEC9DlG1jSf/L8U+A7FHJIVVzU
Elide the above like so:
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPERbzAAoJEEFnBt12D9kBmDIP/R9Vspc6yhjSAEvdp/VET2gi
TgAQfdp4VuYjjIt4cUPO5UQU9kw478GjTuP2blZEC9DlG1jSf/L8U+A7FHJIVVzU
...
=mup8
-----END PGP SIGNATURE-----
> +Merge tags 'devicetree-for-linus' and 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6
> +
> +2nd set of device tree changes and SPI bug fixes for v3.3
> +
> +* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6:
> + of/irq: Add interrupts-names property to name an irq resource
> + of/address: Add reg-names property to name an iomem resource
> +
> +* tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6:
> + spi/tegra: depend instead of select TEGRA_SYSTEM_DMA
> +------------------------------------------------
> +
> +More precisely, a commit object begins with of one or more lines
> +delimited by ASCII LF. The end of the header is signalled by an empty
> +line. Any remaining text after the empty line forms the commit
Drop "More precisely, ". Also notice that you abruptly said "end of the
header" without mentioning anything about "header" in the previous
sentence.
A commit object begins with the "header" part, that consists of
one or more lines delimited by LF, and the "body" part, that
records the commit log message. The first empty line delimits the
header and the body.
> +The header must not contain NUL.
I vaguely recall that you made sure neither the header nor the body
contains NUL.
> +A "continuation line" in the header begins with an SP. The remainder
> +of the line, after removing that SP, is concatenated to the previous
> +line, while retaining the LF at the end of the previous line.
> +
> +When a line in the header begins with a letter other than SP, and has
> +at least one SP in it, it is called a "field". A field consists of the
> +"field name", which is the string before the first SP on the line, and
> +its "value", which is everything after that SP. When the value
> +consists of multiple lines, continuation lines are used.
> +
> +More than one field with the same name can appear in the header of an
> +object, and the order in which they appear is significant. A commit
> +object can contain these fields in the listed order:
s/can contain/contains/; as you are marking optional ones with "zero or".
> +1. one "tree" field with the 40-character textual object name of the
> + associated tree object
> +2. zero or more "parent" fields, each with 40-character textual object
> + name of the parent commit object
> +3. one "author" field with an ident string
> +4. one "committer" field with an ident string
> +5. zero or one "encoding" field with an ascii string
s/zero or one/optionally, one/ (not a strong preference--I just felt that
would be easier to read).
After the above fields, other fields may follow, and new types of
fields may be added in later versions of git. Example of these
optional fields are:
- "mergetag" that copies the contents of a signed tag on one of
the parent commit;
- "gpgsig" that records a GPG signature for this commit object.
> +6. zero or more "mergetag" fields with associated tag object content
> +7. zero or one "gpgsig" field with gpg signature content
and exclude these two from the numbering above to make it clear they are
optional.
> +Ident strings
> +~~~~~~~~~~~~~
> +Ident strings record who's responsible of doing something at what
> +time. For a commit, the ident string in "author" line records who is
> +the author of the associated changes and when the changes are
s/are/were/, perhaps? Again, what the purpose of this document? If this
were more than to strictly describe the "structure", it is OK and even
preferable to leave the meaning the "author" as vague, but if this were
also to suggest the best current practice interpretation, it may be worth
to add something like
There may be a case where it is difficult to attribute a commit to
a single author; think of it as recording the primary contact, the
person to ask any questions about the commit if needed later.
> +made. The ident string in "committer" line records who commits the
s/commits/committed/, perhaps?
> +changes to the repository and at what time.
> +
> +An ident string consists of an email address and a timestamp. More
> +precisely:
s/of an email/of a name, an email/;
s/. More precisely:/:/;
> +1. Optionally, a name
> +2. An email address wrapped around by `<` and `>`, followed by one
> + space (ASCII SP)
The above makes it sound as if "A U Thor<author@example.xz>" is usual and
valid. How about
1. A name, followed by one ASCII SP
and after this enumeration, say something like:
Name may be missing in commit objects produced by repository
conversion from other SCMs that do not have it. Name and email
are typically encoded in UTF-8.
even though I am not sure the last sentence should be in this document.
> +3. The number of seconds since Epoch (00:00:00 UTC, January 1, 1970)
> + followed by a space (ASCII SP)
> +4. Timezone: either plus or minus sign, followed by 4 decimal digits
> +
> +Name and email are encoded in UTF-8 and must must not contain ASCII
> +NUL characters.
Drop " and must must ...characters"; you already said that the header does
not have any NUL. As I already said, I am not sure if you should mention
"UTF-8" at all in this document.
> +Commit encoding
> +~~~~~~~~~~~~~~~
> +Encoding field describes that encoding that the commit message is
> +encoded in.
s/that encoding that/the character encoding in which/;
s/encoded in/recorded/;
> +... Encoding names must be recognized by iconv(3). By default,
> +commit message is in UTF-8. It's discouraged to use encodings that can
> +generate ASCII NUL characters.
Here we would probably want to have a paragraph each for "mergetag" and
"gpgsig".
> +TAG OBJECTS
> +-----------
> +Tag object payload contains an object, object type, tag name, the name
> +of the person ("tagger") who created the tag, and a message, which may
> +contain a signature.
s/a signature/a signature at the end/;
> +------------------------------------------------
> +$ git cat-file tag v1.5.0
> +object 437b1b20df4b356c9342dac8d38849f24ef44f27
> +type commit
> +tag v1.5.0
> +tagger Junio C Hamano <junkio@cox.net> 1171411200 +0000
> +
> +GIT 1.5.0
> +-----BEGIN PGP SIGNATURE-----
> +Version: GnuPG v1.4.6 (GNU/Linux)
> +
> +iD8DBQBF0lGqwMbZpPMRm5oRAuRiAJ9ohBLd7s2kqjkKlq1qqC57SbnmzQCdG4ui
> +nLE/L9aUXdWeTFPron96DLA=
> +=2E+0
> +-----END PGP SIGNATURE-----
> +------------------------------------------------
> +
> +Tag object format resembles commit format. A tag commit may have the
> +following fields in listed order:
> +
> +1. one "object" field with 40-character textual object name of the
> + tagged object
> +2. one "type" field with type of the tagged object ("commit", "tag",
> + "blob", or "tree")
> +3. one "tag" field with the name of the tag
> +4. one "tagger" with an ident string
> +
> +New kinds of fields may be added in later versions of git.
> +
> +Any remaining text after the header forms the tag message. Tag message
> +has no specified encoding. Anything that does not contain ASCII NUL
> +characters are accepted.
> +
> +The object field must point to a valid object of type indicated by the
> +type field. The tag name can be an arbitrary string without NUL bytes
> +or embedded newlines; in practice it usually follows the restrictions
> +described in linkgit:git-check-ref-format[1].
A description of how the signature part is formed needs to come here.
> +GIT
> +---
> +Part of the linkgit:git[1] suite
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC v2] Document format of basic Git objects
2012-02-19 8:39 ` Junio C Hamano
@ 2012-02-19 9:14 ` Junio C Hamano
2012-02-20 13:55 ` Nguyen Thai Ngoc Duy
1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2012-02-19 9:14 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy
Cc: git, Jonathan Niedier, Shawn O. Pearce, Scott Chacon
Junio C Hamano <gitster@pobox.com> writes:
> Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
>
>> Still draft for discussion.
Small clarifications and corrections.
> ... from a larger
> workflow point of view it would not be so useful, as it would involve
> steps like this:
>
> ...
> * Then you give the general public the resulting commit.
The point of the above is *not* that it involves hash-object or having to
preserve both author and committer dates when secondary signers sign the
commit---these are something the tool *could* learn to assist. The point
is that adding more signatures *must* change the resulting commit object
name, making it necessary not to expose it to the general public in order
to avoid history rewinding, which *is* what makes it "not so useful".
And that is why I didn't add such a tool support to help producing the end
result that wouldn't be so useful anyway.
>> +TREE OBJECTS
>> +------------
>> +Tree object payload contains a list of entries, each with a mode,
>> +object type, object name, and filename, sorted by filename. It
>> +represents the contents of a single directory tree.
>
> Drop "object type," from this list. It is inferred from the mode. I
> personally would prefer to say "path" or "pathname" when the entity
> referred to may not be a regular file.
The principle is not to say "filename" to give an incorrect impression
that we are only talking about a regular file. This principle applies to
pathnames in general (i.e. covers what is recorded in the index, too), but
because we are talking about an entry in a tree, "pathname component" is
even better than "path" or "pathname", because it has a specific meaning:
one part of pathname delimited by a slash.
POSIX does use "filename" for this purpose (and mentions "pathname
component" as a synonym), but if we use the word, without clarifying that
this document uses it in the strict POSIX sense, the reader can easily
misunderstand that we mean a more general "name of a regular file".
>> +The object type may be a blob, representing the contents of a file,
>> +another tree, representing the contents of a subdirectory, or a commit
>> +(representing a subproject).
>
> and drop the above line.
Should be obvious from the context, but I meant "drop the above three
lines".
> I personally do not think it is necessary to have the above paragraph at
> all in this object.
s/in this object/in this document/;
>> +Encoding
>> +~~~~~~~~
>
> "Encoding" is such a loaded word and does not help clarify what this
> section is really about, which is "format of a tree entry", or simply
> "Entries".
>
>> +Entries are of variable length and self-delimiting. Each entry
>> +consists of
Actually, title this section as "Tree Entries", and begin the paragraph
with
Tre entries are of ...delimiting. Each entry consists of...
>> +Ident strings
>> +~~~~~~~~~~~~~
>> +Ident strings record who's responsible of doing something at what
>> +time. For a commit, the ident string in "author" line records who is
>> +the author of the associated changes and when the changes are
>
> s/are/were/, perhaps? Again, what the purpose of this document? If this
> were more than to strictly describe the "structure", it is OK and even
s/ more than to/to/;
> preferable to leave the meaning the "author" as vague, but if this were
> also to suggest the best current practice interpretation, it may be worth
> to add something like
>
> There may be a case where it is difficult to attribute a commit to
> a single author; think of it as recording the primary contact, the
> person to ask any questions about the commit if needed later.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Manually decoding a git object
2012-02-19 4:15 ` [PATCH/RFC v2] " Nguyễn Thái Ngọc Duy
2012-02-19 8:39 ` Junio C Hamano
@ 2012-02-19 18:07 ` Philip Oakley
2012-02-20 4:45 ` 徐迪
2012-02-20 8:29 ` Thomas Rast
1 sibling, 2 replies; 17+ messages in thread
From: Philip Oakley @ 2012-02-19 18:07 UTC (permalink / raw)
To: Git List
If I have a renamed file which is a git object, such a "Git_Object", was
8c-something-or-other, what is the easiest way of examining / decoding /
recreating the original file (either as its sha1, or a cat-file).
I don't appear to be able to unzip the file in its raw format... I'm using
Msysgit on windows XP.
Philip
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-19 18:07 ` Manually decoding a git object Philip Oakley
@ 2012-02-20 4:45 ` 徐迪
2012-02-20 8:19 ` Philip Oakley
2012-02-20 8:29 ` Thomas Rast
1 sibling, 1 reply; 17+ messages in thread
From: 徐迪 @ 2012-02-20 4:45 UTC (permalink / raw)
To: Philip Oakley; +Cc: Git List
2012/2/20 Philip Oakley <philipoakley@iee.org>:
> If I have a renamed file which is a git object, such a "Git_Object", was
> 8c-something-or-other, what is the easiest way of examining / decoding /
> recreating the original file (either as its sha1, or a cat-file).
>
I don't think I fully understood what you mean, I assume you just move
an object file from $GIT_DIR/objects/ to somewhere and rename it,
let's call it "obj", so if you want to exam its content you can just
simply call "git cat-file -p obj". And you can also use "git cat-file
-t obj" to exam its object type. If it's a blob you can use "git
cat-file -p obj > original" to recreate it, else it's meaningless to
recreate it.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-20 4:45 ` 徐迪
@ 2012-02-20 8:19 ` Philip Oakley
0 siblings, 0 replies; 17+ messages in thread
From: Philip Oakley @ 2012-02-20 8:19 UTC (permalink / raw)
To: ??; +Cc: Git List
From: "??" <xudifsd@gmail.com> Sent: Monday, February 20, 2012 4:45 AM
> 2012/2/20 Philip Oakley <philipoakley@iee.org>:
>> If I have a renamed file which is a git object, such a "Git_Object", was
>> 8c-something-or-other, what is the easiest way of examining / decoding /
>> recreating the original file (either as its sha1, or a cat-file).
>>
> I don't think I fully understood what you mean, I assume you just move
> an object file from $GIT_DIR/objects/ to somewhere and rename it,
> let's call it "obj", so if you want to exam its content you can just
> simply call "git cat-file -p obj". And you can also use "git cat-file
> -t obj" to exam its object type. If it's a blob you can use "git
> cat-file -p obj > original" to recreate it, else it's meaningless to
> recreate it.
When I tried it from my home directory (not in a git directory):
$ git cat-file -p Git-Object
fatal: Not a git repository (or any of the parent directories): .git
Because its sha1 isn't yet known I can't put it into the correct
.git/objects/xx/ subdirectory of an fresh 'git init', and I have located an
unzip programme that will take the plain git object and decode it - they all
expect archives.
I've described the background use-case at
http://stackoverflow.com/questions/9341278/how-to-track-the-git-directory-in-git-in-its-own-store -
the edit links to a typical corporate scenario.
Even just locating a zlib implementation that simply confirms the file
stream is compressesd and deflates it would be a start.
Philip
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-19 18:07 ` Manually decoding a git object Philip Oakley
2012-02-20 4:45 ` 徐迪
@ 2012-02-20 8:29 ` Thomas Rast
2012-02-20 10:19 ` Philip Oakley
1 sibling, 1 reply; 17+ messages in thread
From: Thomas Rast @ 2012-02-20 8:29 UTC (permalink / raw)
To: Philip Oakley; +Cc: Git List
"Philip Oakley" <philipoakley@iee.org> writes:
> If I have a renamed file which is a git object, such a "Git_Object", was
> 8c-something-or-other, what is the easiest way of examining / decoding /
> recreating the original file (either as its sha1, or a cat-file).
>
> I don't appear to be able to unzip the file in its raw format... I'm using
> Msysgit on windows XP.
The SHA1 is over the decompressed object contents. The file simply
holds a zlib-compressed stream of those contents. (It's pretty much
like gzip without the file header.)
You can use any bindings to zlib and something that does sha1, e.g. in
python:
$ cd g/.git/objects/aa/ # my git.git
$ ls
592bda986a8380b64acd8cbb3d5bdfcbc0834d 6322a757bee31919f54edcc127608a3d724c99
$ python
Python 2.7.2 (default, Aug 19 2011, 20:41:43) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> hashlib.sha1(open('592bda986a8380b64acd8cbb3d5bdfcbc0834d').read().decode('zlib')).digest().encode('hex')
'aa592bda986a8380b64acd8cbb3d5bdfcbc0834d'
Notice that the first byte of the hash goes into the directory name.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-20 8:29 ` Thomas Rast
@ 2012-02-20 10:19 ` Philip Oakley
2012-02-20 10:56 ` Thomas Rast
0 siblings, 1 reply; 17+ messages in thread
From: Philip Oakley @ 2012-02-20 10:19 UTC (permalink / raw)
To: Thomas Rast; +Cc: Git List, 徐迪
From: "Thomas Rast" <trast@inf.ethz.ch> Sent: Monday, February 20, 2012 8:29
AM
> "Philip Oakley" <philipoakley@iee.org> writes:
>
>> If I have a renamed file which is a git object, such a "Git_Object", was
>> 8c-something-or-other, what is the easiest way of examining / decoding /
>> recreating the original file (either as its sha1, or a cat-file).
>>
>> I don't appear to be able to unzip the file in its raw format... I'm
>> using
>> Msysgit on windows XP.
Correction to reply to xu's message: and I have /NOT/ located an
unzip programme that will take the plain git object and decode it.
>
> The SHA1 is over the decompressed object contents. The file simply
> holds a zlib-compressed stream of those contents. (It's pretty much
> like gzip without the file header.)
>
> You can use any bindings to zlib and something that does sha1, e.g. in
> python:
>
> $ cd g/.git/objects/aa/ # my git.git
> $ ls
> 592bda986a8380b64acd8cbb3d5bdfcbc0834d
> 6322a757bee31919f54edcc127608a3d724c99
> $ python
> Python 2.7.2 (default, Aug 19 2011, 20:41:43) [GCC] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import hashlib
> >>>
> hashlib.sha1(open('592bda986a8380b64acd8cbb3d5bdfcbc0834d').read().decode('zlib')).digest().encode('hex')
> 'aa592bda986a8380b64acd8cbb3d5bdfcbc0834d'
>
> Notice that the first byte of the hash goes into the directory name.
>
At the moment I'm in a Catch 22 situation where I can't make the first step
of examining the deflated contents, so I can't do all those next steps to
get the sha1 etc.. Have I misunderstood your suggestions?
Philip
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-20 10:19 ` Philip Oakley
@ 2012-02-20 10:56 ` Thomas Rast
2012-02-20 11:39 ` 徐迪
2012-02-20 18:27 ` Philip Oakley
0 siblings, 2 replies; 17+ messages in thread
From: Thomas Rast @ 2012-02-20 10:56 UTC (permalink / raw)
To: Philip Oakley; +Cc: Git List, 徐迪
Philip Oakley <philipoakley@iee.org> writes:
> From: "Thomas Rast" <trast@inf.ethz.ch> Sent: Monday, February 20,
> 2012 8:29 AM
>>
>> The SHA1 is over the decompressed object contents. The file simply
>> holds a zlib-compressed stream of those contents. (It's pretty much
>> like gzip without the file header.)
>>
>> You can use any bindings to zlib and something that does sha1, e.g. in
>> python:
>>
>> $ cd g/.git/objects/aa/ # my git.git
>> $ ls
>> 592bda986a8380b64acd8cbb3d5bdfcbc0834d
>> 6322a757bee31919f54edcc127608a3d724c99
>> $ python
>> Python 2.7.2 (default, Aug 19 2011, 20:41:43) [GCC] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import hashlib
>> >>>
>> hashlib.sha1(open('592bda986a8380b64acd8cbb3d5bdfcbc0834d').read().decode('zlib')).digest().encode('hex')
>> 'aa592bda986a8380b64acd8cbb3d5bdfcbc0834d'
>>
>> Notice that the first byte of the hash goes into the directory name.
>>
>
> At the moment I'm in a Catch 22 situation where I can't make the first
> step of examining the deflated contents, so I can't do all those next
> steps to get the sha1 etc.. Have I misunderstood your suggestions?
Huh? The method I showed does not rely on knowing the SHA1. The fact
that I used it on a properly filed away (by its SHA1) object file is
immaterial, if perhaps confusing.
I can untangle that python expression for you:
hashlib.sha1(foo).digest() gives the SHA1 digest of the string foo, as a (binary) string
foo.encode('hex') turns foo from (binary) string into its hex representation
open('filename').read() opens the file called filename, and returns its whole contents
foo.decode('zlib') applies the zlib decompressor to foo, and returns the resulting data
So that trick works for any file[*], and you can then use its results to
file it back where it needs to go.
[*] that is sufficiently small for Python to hold it in memory, but git
shares the same problems in that department.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-20 10:56 ` Thomas Rast
@ 2012-02-20 11:39 ` 徐迪
2012-02-20 18:27 ` Philip Oakley
1 sibling, 0 replies; 17+ messages in thread
From: 徐迪 @ 2012-02-20 11:39 UTC (permalink / raw)
To: Thomas Rast; +Cc: Philip Oakley, Git List
2012/2/20 Thomas Rast <trast@inf.ethz.ch>:
> Philip Oakley <philipoakley@iee.org> writes:
>
>> From: "Thomas Rast" <trast@inf.ethz.ch> Sent: Monday, February 20,
>> 2012 8:29 AM
>>>
>>> The SHA1 is over the decompressed object contents. The file simply
>>> holds a zlib-compressed stream of those contents. (It's pretty much
>>> like gzip without the file header.)
>>>
>>> You can use any bindings to zlib and something that does sha1, e.g. in
>>> python:
>>>
>>> $ cd g/.git/objects/aa/ # my git.git
>>> $ ls
>>> 592bda986a8380b64acd8cbb3d5bdfcbc0834d
>>> 6322a757bee31919f54edcc127608a3d724c99
>>> $ python
>>> Python 2.7.2 (default, Aug 19 2011, 20:41:43) [GCC] on linux2
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import hashlib
>>> >>>
>>> hashlib.sha1(open('592bda986a8380b64acd8cbb3d5bdfcbc0834d').read().decode('zlib')).digest().encode('hex')
>>> 'aa592bda986a8380b64acd8cbb3d5bdfcbc0834d'
>>>
>>> Notice that the first byte of the hash goes into the directory name.
>>>
I think Thomas got the point.
> When I tried it from my home directory (not in a git directory):
> $ git cat-file -p Git-Object
> fatal: Not a git repository (or any of the parent directories): .git
this is because git will first do a git-dir-search, if you're current
work dir is not within git repo, it will die.
I really do not know how you get thing that mess. From the link[1] you
give, i think you just want to clone a repo across computer not by
network, if so this[2] will be helpful.
[1]:http://stackoverflow.com/questions/9343260/what-after-git-unpack-objects-to-get-the-actual-file
[2]:http://progit.org/2010/03/10/bundles.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC v2] Document format of basic Git objects
2012-02-19 8:39 ` Junio C Hamano
2012-02-19 9:14 ` Junio C Hamano
@ 2012-02-20 13:55 ` Nguyen Thai Ngoc Duy
2012-02-20 16:11 ` Jeff King
1 sibling, 1 reply; 17+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-20 13:55 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jonathan Niedier, Shawn O. Pearce, Scott Chacon
2012/2/19 Junio C Hamano <gitster@pobox.com>:
>> - Do we assume tag/commit header in utf-8 or ascii?
>
> Author-ident is typically utf-8 already, so you cannot assume "ASCII".
I wonder if anyone puts non utf-8 strings in there, or could we
enforce utf-8 (i.e. validate and reject non utf-8 strings) and accept
encoded word syntax (rfc 2047) with the help of the new
$GIT_IDENT_ENCODING variable. The "accept ..." part can wait until
someone is hit by "utf-8 only" check and steps up.
By the same reasoning, maybe we should declare tag content is utf-8
only, until someone needs and adds "encoding" support for it.
>> +The filename may be an arbitrary nonempty string of bytes, as long as
>> +it contains no '/' or NUL character.
>
> s/, as long as it contains no/; it cannot contain any/
Pathname also cannot be "." nor "..", I suppose.
Since we also support Windows, should '\\' be banned too? ... probably
not worth it.
>> +The header must not contain NUL.
>
> I vaguely recall that you made sure neither the header nor the body
> contains NUL.
One of the purposes of this document is to note all constraints and
limitations (another one is a reference for users who want to dig deep
in git data structure without looking at the code). The problem with
handling NUL probably only exists in C Git (and maybe libgit2). I'll
turn that to "should not contain NUL".
--
Duy
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC v2] Document format of basic Git objects
2012-02-20 13:55 ` Nguyen Thai Ngoc Duy
@ 2012-02-20 16:11 ` Jeff King
0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2012-02-20 16:11 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy
Cc: Junio C Hamano, git, Jonathan Niedier, Shawn O. Pearce,
Scott Chacon
On Mon, Feb 20, 2012 at 08:55:28PM +0700, Nguyen Thai Ngoc Duy wrote:
> > Author-ident is typically utf-8 already, so you cannot assume "ASCII".
>
> I wonder if anyone puts non utf-8 strings in there, or could we
> enforce utf-8 (i.e. validate and reject non utf-8 strings) and accept
> encoded word syntax (rfc 2047) with the help of the new
> $GIT_IDENT_ENCODING variable. The "accept ..." part can wait until
> someone is hit by "utf-8 only" check and steps up.
I was just having a similar discussion with libgit2 folks, who were
wondering if there would ever be non-utf8 in there. When we call
"reencode_commit_message", it looks like we do the whole object. In
other words, your author name _must_ match any encoding you specify in
the "encoding" header.
I.e., if you do:
# latin1 é
e=`printf '\xe9'`
export GIT_AUTHOR_NAME="P${e}ff King"
git init
git config i18n.commitencoding iso8859-1
touch foo && git add foo &&
git commit --allow-empty -m "more latin1 ${e}ncoding"
both the name and the message should show fine on your utf8 terminal if
you do this:
git config i18n.logoutputencoding utf8
git show
And similarly, we do the right thing in format-patch, both with and
without logoutputencoding set:
$ git format-patch --root --stdout | grep -Ei "^(from|subject):"
From: =?iso8859-1?q?P=E9ff=20King?= <peff@peff.net>
Subject: [PATCH] =?iso8859-1?q?more=20latin1=20=E9ncoding?=
$ git config i18n.logoutputencoding utf8
$ git format-patch --root --stdout | grep -Ei "^(from|subject):"
From: =?utf8?q?P=C3=A9ff=20King?= <peff@peff.net>
Subject: [PATCH] =?utf8?q?more=20latin1=20=C3=A9ncoding?=
(where 0xc3a9 is the utf8 equivalent of latin1 0xe9).
So I have no idea if people are using it or not, but it is actually
usable.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Manually decoding a git object
2012-02-20 10:56 ` Thomas Rast
2012-02-20 11:39 ` 徐迪
@ 2012-02-20 18:27 ` Philip Oakley
1 sibling, 0 replies; 17+ messages in thread
From: Philip Oakley @ 2012-02-20 18:27 UTC (permalink / raw)
To: Thomas Rast; +Cc: Git List, 徐迪
From: "Thomas Rast" <trast@inf.ethz.ch> Sent: Monday, February 20, 2012
10:56 AM
> Philip Oakley <philipoakley@iee.org> writes:
>
>> From: "Thomas Rast" <trast@inf.ethz.ch> Sent: Monday, February 20,
>> 2012 8:29 AM
>>>
>>> The SHA1 is over the decompressed object contents. The file simply
>>> holds a zlib-compressed stream of those contents. (It's pretty much
>>> like gzip without the file header.)
>>>
>>> You can use any bindings to zlib and something that does sha1, e.g. in
>>> python:
>>>
>>> $ cd g/.git/objects/aa/ # my git.git
>>> $ ls
>>> 592bda986a8380b64acd8cbb3d5bdfcbc0834d
>>> 6322a757bee31919f54edcc127608a3d724c99
>>> $ python
>>> Python 2.7.2 (default, Aug 19 2011, 20:41:43) [GCC] on linux2
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import hashlib
>>> >>>
>>> hashlib.sha1(open('592bda986a8380b64acd8cbb3d5bdfcbc0834d').read().decode('zlib')).digest().encode('hex')
>>> 'aa592bda986a8380b64acd8cbb3d5bdfcbc0834d'
>>>
>>> Notice that the first byte of the hash goes into the directory name.
>>>
>>
>> At the moment I'm in a Catch 22 situation where I can't make the first
>> step of examining the deflated contents, so I can't do all those next
>> steps to get the sha1 etc.. Have I misunderstood your suggestions?
>
> Huh? The method I showed does not rely on knowing the SHA1. The fact
> that I used it on a properly filed away (by its SHA1) object file is
> immaterial, if perhaps confusing.
>
> I can untangle that python expression for you:
>
> hashlib.sha1(foo).digest() gives the SHA1 digest of the string foo,
> as a (binary) string
> foo.encode('hex') turns foo from (binary) string into its
> hex representation
> open('filename').read() opens the file called filename, and
> returns its whole contents
> foo.decode('zlib') applies the zlib decompressor to foo, and
> returns the resulting data
>
> So that trick works for any file[*], and you can then use its results to
> file it back where it needs to go.
>
>
> [*] that is sufficiently small for Python to hold it in memory, but git
> shares the same problems in that department.
>
I see what you mean now. I'll need to work out how to get Python in
msysgit - the 'minimal' part of msys keeps on biting... That is, I didn't
see it (Python) in the 1.7.8 full install bash - I didn't see it anyway.
I was hopeful that unzip/gunzip would have an option to simply deflate a
(file)stream, rather than it expecting the normal file archive.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-02-20 18:27 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-15 13:22 [PATCH/RFC] Document format of basic Git objects Nguyễn Thái Ngọc Duy
2012-02-15 17:31 ` Jonathan Nieder
2012-02-15 19:48 ` Junio C Hamano
2012-02-16 7:12 ` Junio C Hamano
2012-02-19 4:15 ` [PATCH/RFC v2] " Nguyễn Thái Ngọc Duy
2012-02-19 8:39 ` Junio C Hamano
2012-02-19 9:14 ` Junio C Hamano
2012-02-20 13:55 ` Nguyen Thai Ngoc Duy
2012-02-20 16:11 ` Jeff King
2012-02-19 18:07 ` Manually decoding a git object Philip Oakley
2012-02-20 4:45 ` 徐迪
2012-02-20 8:19 ` Philip Oakley
2012-02-20 8:29 ` Thomas Rast
2012-02-20 10:19 ` Philip Oakley
2012-02-20 10:56 ` Thomas Rast
2012-02-20 11:39 ` 徐迪
2012-02-20 18:27 ` Philip Oakley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).