Git files data formats documentation

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Git files data formats documentation
@ 2006-08-05  5:39 A Large Angry SCM
  2006-08-05  5:48 ` Jon Smirl
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: A Large Angry SCM @ 2006-08-05  5:39 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 79 bytes --]

This information may be useful for reading and writing the various Git 
files.

[-- Attachment #2: dataformats.txt --]
[-- Type: text/plain, Size: 14398 bytes --]

Git files data formats
======================

OBJECTS
-------
# The object ID, or "name", of an object is
#	_sha-1_digest_( <OBJECT_HEADER> <object_CONTENTS> ).

<BLOB>
	:	_deflate_( <OBJECT_HEADER> <BLOB_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <BLOB_CONTENTS> )
	;

<BLOB_CONTENTS>
	:	<DATA>
	;

<TREE>
	:	_deflate_( <OBJECT_HEADER> <TREE_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <TREE_CONTENTS> )
	;

<TREE_CONTENTS>
	:	<TREE_ENTRIES>
	;

<TREE_ENTRIES>
	# Tree entries are sorted by the byte sequence that comprises
	# the entry name.
	:	( <TREE_ENTRY> )*
	;

<TREE_ENTRY>
	# The type of the object referenced MUST be appropriate for
	# the mode. Regular files and symbolic links reference a BLOB
	# and directories reference a TREE.
	:	<OCTAL_MODE> <SP> <NAME> <NUL> <BINARY_OBJ_ID>
	;

<COMMIT>
	:	_deflate_( <OBJECT_HEADER> <COMMIT_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <COMMIT_CONTENTS> )
	;

<COMMIT_CONTENTS>
	:	"tree" <SP> <HEX_OBJ_ID> <LF>
		( "parent" <SP> <HEX_OBJ_ID> <LF> )*
		"author" <SP>
			<SAFE_NAME> <SP>
			<LT> <SAFE_EMAIL> <GT> <SP>
			<GIT_DATE> <LF>
		"committer" <SP>
			<SAFE_NAME> <SP>
			<LT> <SAFE_EMAIL> <GT> <SP>
			<GIT_DATE> <LF>
		<LF>
		<DATA>
	;

<TAG>
	:	_deflate_( <OBJECT_HEADER> <TAG_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <TAG_CONTENTS> )
	;

<TAG_CONTENTS>
	:	"object" <SP> <HEX_OBJ_ID> <LF>
		"type" <SP> <NONTAG_OBJ_TYPE> <LF>
		"tag" <SP> <TAG_NAME> <LF>
		<LF>
		<DATA>
	;

<OBJECT_HEADER>
	:	<OBJ_TYPE> <SP> <DECIMAL_LENGTH> <NUL>
	;

<COMPACT_OBJECT_HEADER>
	# The object type DELTA_ENCODED is not valid in a
	# <COMPACT_OBJECT_HEADER>.
	:	<TYPE_AND_BASE128_SIZE>
	;

<DATA>
	# Uninterpreted sequence of bytes.
	;

<OCTAL_MODE>
	# Octal encoding, without prefix, of the file system object
	# type and permission bits. The bit layout is according to the
	# POSIX standard, with only regular files, directories, and
	# symbolic links permitted. The actual permission bits are
	# all zero except for regular files. The only permission bit
	# of any consequence to Git is the owner executable bit. By
	# default, the permission bits for files will be either 0644
	# or 0755, depending on the owner executable bit.
	;

<NAME>
	# Sequence of bytes not containing the ASCII character byte
	# value NUL (0x00).
	;

<BINARY_OBJ_ID>
	# The object ID of the referenced object.
	;

<HEX_OBJ_ID>
	# Hexidecimal encoding (lower case) of the <BINARY_OBJ_ID>.
	;

<SAFE_NAME>
	:	<SAFE_STRING> 
	;

<SAFE_EMAIL>
	:	<SAFE_STRING>
	;

<SAFE_STRING>
	# A sequence of bytes not containing the ASCII character byte
	# values NUL (0x00), LF (0x0a), '<' (0c3c), or '>' (0x3e).
	#
	# The sequence may not begin or end with any bytes with the
	# following ASCII character byte values: SPACE (0x20),
	# '.' (0x2e), ',' (0x2c), ':' (0x3a), ';' (0x3b), '<' (0x3c),
	# '>' (0x3e), '"' (0x22), "'" (0x27).
	;

<GIT_DATE>
	:	<SECONDS> <SP> <TZ_OFFSET>
	;

<SECONDS>
	# Base 10, ASCII encoding of the number of seconds since 12:00
	# midnight January 1, 1970, UTC without accounting for leap
	# seconds, and without leading zeros.
	;

<TZ_OFFSET>
	# Signed offset of time zone from UTC.
	:	<TZ_OFFSET_SIGN> <TZ_OFFSET_HOURS> <TZ_OFFSET_MIN>
	;

<TZ_OFFSET_SIGN>
	:	"+"
	|	"-"
	;

<TZ_OFFSET_HOURS>
	:	<DIGIT> <DIGIT>
	;

<TZ_OFFSET_MIN>
	# Valid values are "00" to "59" inclusive.
	:	<DIGIT> <DIGIT>
	;

<DIGIT>
	# ASCII decimal digit.
	;

<NONTAG_OBJ_TYPE>
	:	"BLOB"
	|	"TREE"
	|	"COMMIT"
	;

<OBJ_TYPE>
	:	<NONTAG_OBJ_TYPE>
	|	"TAG"
	;

<DECIMAL_LENGTH>
	# Base 10, ASCII encoding of the byte length of the object
	# contents, without leading zeros. The length value does not
	# include the length of the <OBJECT_HEADER>.
	:	( <DIGIT> )+
	;

<SP>
	# ASCII SPACE (0x20) character.
	;

<NUL>
	# ASCII NUL (0x00) character.
	;

<LF>
	# ASCII LF (0x0a) "line feed" character.
	;


PACK FILE
---------
# The name of a pack file is "pack-${PACK_ID}.pack", where ${PACK_ID}
# is the hexidecimal encoding (lower case) of the SHA-1 digest of the
# sorted list of binary object IDs in the pack file without a separator
# between the object IDs. Initially, the ${PACK_ID} for a pack was not
# defined, making the value effectively random.

<PACK_FILE>
	:	<PACK_FILE_CONTENTS> <PACK_FILE_CHECKSUM>
	;

<PACK_FILE_CONTENTS>
	:	"PACK" <PACK_VERSION> <PACK_OBJECT_COUNT>
		( <PACKED_OBJECT_HEADER> <PACKED_OBJECT_DATA> )*
		<PACK_FILE_CHECKSUM>
	;

<PACK_VERSION>
	# 32 bit, network byte order, binary integer indicating which
	# version of the pack file format was used to create the pack
	# file.
	;

<PACK_OBJECT_COUNT>
	# 32 bit, network byte order, binary integer containg the
	# number of objects encoded in the pack file.
	;

<PACK_FILE_CHECKSUM>
	:	_sha-1_digest_( <PACK_FILE_CONTENTS> )
	;


<PACKED_OBJECT_HEADER>
	# If the object type is not a DELTA_ENCODED object, the packed
	# object data that follows is the deflated byte sequence of the
	# object without the Git object header. The length value is the
	# byte count of the inflated byte sequence of the object.
	#
	# If the object type is a DELTA_ENCODED object, what follows is
	# the ID of the base object and the deflated delta data to
	# transform the base object into the target object. The type of
	# the target object is the same as that of the base object and
	# the length value is the byte count of the inflated delta
	# data. The base object may also be DELTA_ENCODED but cyclic
	# base object chains are not permitted and the pack file MUST
	# contain all base objects.
	:	<TYPE_AND_BASE128_SIZE>
	;

<TYPE_AND_BASE128_SIZE>
	# A compact, variable length, encoding of the packed object
	# length and type. The first byte is comprised of 3 fields
	# (where bit 0 is the least significant bit in a byte):
	#	bit 7:		more flag
	#	bits 6-4:	object type
	#	bits 3-0:	least significant bits of the object
	#			length.
	# If the more flag is set, the next byte contains more object
	# length bits.
	# The object types corresponding to the object type bits are:
	#	6 5 4
	#	- - -
	#	0 0 0	invalid: Reserved
	#	0 0 1	COMMIT object
	#	0 1 0	TREE object
	#	0 1 1	BLOB object
	#	1 0 0	TAG object
	#	1 0 1	invalid: Reserved
	#	1 1 0	invalid: Reserved
	#	1 1 1	DELTA_ENCODED object
	#
	# If the more flag was set, the next byte will have more length
	# bits and will be comprised to 2 fields:
	#	bit 7:		more flag
	#	bits 6-0:	7 additional, more significant, bits of
	#			the object length
	# If the more flag is set, the next byte contains more object
	# length bits using the same encoding.
	;

<PACKED_OBJECT_DATA>
	:	_deflate_( <DATA> )
	|	<BINARY_OBJ_ID> _deflate_( <DELTA_DATA> )
	;

<DELTA_DATA>
	# Size of the base object encoded as a base 128 number, least
	# significant bits first, using bit 7 (the most significant
	# bit) of each byte to indicate that more bits follow.
	#
	# Size of the result object encoded as a base 128 number, using
	# the same method as used for the base object size.
	#
	# There will then be a sequence of delta hunks.
	# Zero as the value of the first byte of a hunk in reserved.
	#
	# If bit 7 of the first byte of a delta hunk is not set, the
	# hunk is an "insert" hunk and bits 0-6 specify the number of
	# bytes to append to the output buffer from the hunk.
	#
	# If bit 7 of the first byte of a delta hunk is set, the hunk
	# is a "copy" hunk and bits 0-6 specify how the remaining
	# bytes in the hunk make up the base offset and length for the
	# copy. The following C code demonstrate how to determine the
	# base offset and length for the copy:
	#
	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
	#	 | This reflects version 3 pack files |
	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
	#
	#	byte *data = delta_hunk_start
	#	opcode = *data++
	#	off_t copy_offset= 0;
	#	size_t copy_length = 0;
	#
	#	for (shift=i=0; i<4; i++) {
	#		if (opcode & 0x01) {
	#			copy_offset |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	for (shift=i=0; i<3; i++) {
	#		if (opcode & 0x01) {
	#			copy_length |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	if (!copy_length) {
	#		copy_length = 1<<16;
	#		}
	#
	# For version 2 pack files, the size of a copy is limited to
	# 64K bytes or less and bit 6 of the opcode byte is set if the
	# source of the copy is from the buffer of the result object
	# instead of the the base object.
	#
	# It's unknown if any version 2 pack files were created with
	# bit 6 set in the opcode byte; however, the change that added
	# support for version 3 pack files removed the code that would
	# change the copy source to the result buffer.
	#
	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
	#	 | This reflects version 2 pack files |
	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
	#
	#	byte *data = delta_hunk_start
	#	opcode = *data++
	#	off_t copy_offset= 0;
	#	size_t copy_length = 0;
	#
	#	for (shift=i=0; i<4; i++) {
	#		if (opcode & 0x01) {
	#			copy_offset |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	for (shift=i=0; i<2; i++) {
	#		if (opcode & 0x01) {
	#			copy_length |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	if (!copy_length) {
	#		copy_length = 1<<16;
	#		}
	#
	#	copy_from_result = opcode & 0x01
	#
	;


PACK INDEX
----------
# The name of a pack file index is "pack-${PACK_ID}.idx", where
# ${PACK_ID} is the hexidecimal encoding (lower case) of the SHA-1
# digest of the sorted list of binary object IDs in the pack file
# without a separator between the object IDs. Initially, the ${PACK_ID}
# for a pack was not defined, making the value effectively random.

<PACK_INDEX>
	:	<PACK_INDEX_CONTENTS> <PACK_INDEX_CHECKSUM>
	;

<PACK_INDEX_CONTENTS>
	:	( <INDEX_PARTIAL_COUNT> ){256}
		( <PACK_OBJECT_OFFSET> <BINARY_OBJ_ID> )*
		<PACK_FILE_CHECKSUM>
	;

<INDEX_PARTIAL_COUNT>
	# 32 bit, network byte order, binary integer of the count of
	# objects in the pack file with the first byte of the object
	# ID less than or equal to the index of the count, starting
	# from zero.
	;

<PACK_OBJECT_OFFSET>
	# 32 bit, network byte order, binary integer giving the offset,
	# in bytes from the begining of the pack file, where the
	# encoding of the object starts.
	;

<PACK_INDEX_CHECKSUM>
	:	_sha-1_digest_( <PACK_INDEX_CONTENTS> )
	;


INDEX FILE (CACHE)
------------------

<INDEX_FILE>
	:	<INDEX_FILE_FORMAT_V1>
	|	<INDEX_FILE_FORMAT_V2>
	;

<INDEX_FILE_FORMAT_V1>
	# This format is no longer supported.
	:	<INDEX_HEADER> <INDEX_CHECKSUM> <INDEX_CONTENTS>
	;

<INDEX_FILE_FORMAT_V2>
	:	<INDEX_HEADER> <EXTENDED_INDEX_CONTENTS> <EXTENED_CHECKSUM>
	;

<INDEX_HEADER>
	:	"DIRC" <INDEX_FILE_VERSION> <INDEX_ENTRY_COUNT>
	;

<INDEX_FILE_VERSION>
	# 32 bit, network byte order, binary integer indicating which
	# version of the index file format was used to create the
	# index file.
	;

<INDEX_ENTRY_COUNT>
	# 32 bit, network byte order, binary integer containg the
	# number of index entries in the index file.
	;

<EXTENDED_CHECKSUM>
	:	_sha-1_digest_( <EXTENDED_INDEX_CONTENTS> )
	;

<INDEX_CHECKSUM>
	:	_sha-1_digest_( <INDEX_CONTENTS> )
	;

<INDEX_CONTENTS>
	:	( <INDEX_ENTRY> )*
	;

<EXTENDED_INDEX_CONTENTS>
	:	<INDEX_CONTENTS> <INDEX_CONTENTS_EXTENSIONS>
	;

<INDEX_ENTRY>
	:	<INDEX_ENTRY_STAT_INFO>
		<ENTRY_ID>
		<ENTRY_FLAGS>
		<ENTRY_NAME>
	;

<INDEX_ENTRY_STAT_INFO>
	# These fields are used as a part of a heuristic to determine
	# if the file system entity associated with this entry has
	# changed. The names are very *nix centric but the exact
	# contents of each field have no meaning to Git, besides exact
	# match, except for the <ENTRY_MODE> and <ENTRY_SIZE> fields.
	:	<ENTRY_CTIME>
		<ENTRY_MTIME>
		<ENTRY_DEV>
		<ENTRY_INODE>
		<ENTRY_MODE>
		<ENTRY_UID>
		<ENTRY_GID>
		<ENTRY_SIZE>
	;

<ENTRY_CTIME>
	# The timestamp of the last status change of the associated
	# file system entity.
	:	<ENTRY_TIME>
	;

<ENTRY_MTIME>
	# The timestamp of the last modification of the associated
	# file system entity.
	:	<ENTRY_TIME>
	;

<ENTRY_TIME>
	:	<TIME_LSB32> <TIME_NSEC>
	;

<TIME_LSB32>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) timestamp.
	;

<TIME_NSEC>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) more precise
	# timestamp, if available.
	;

<ENTRY_DEV>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# device identifier. Use of this field is a compile time
	# option.
	;

<ENTRY_INODE>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) inode number, or
	# equivalent.
	;

<ENTRY_MODE>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# entity type and permissions.
	;

<ENTRY_UID>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# entity owner identifier.
	;

<ENTRY_GID>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# entity group identifier, or equivalent.
	;

<ENTRY_SIZE>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) size.
	;

<ENTRY_ID>
	# Object ID of the of the file system entity contents.
	;

<ENTRY_FLAGS>
	# 16 bit, network byte order, binary integer.
	#	bits 15-14	Reserved
	#	bits 13-12	Entry stage
	#	bits 11-0	Name byte length
	#
	# See git-read-tree(1) for a description of how the stage
	# field is used.
	;

<ENTRY_NAME>
	# File system entity name. Path is normalized and relative to
	# the working directory.
	;

<INDEX_CONTENTS_EXTENSIONS>
	:	( <INDEX_EXTENSION> )*
	;

<INDEX_EXTENSION>
	:	<INDEX_EXTENSION_HEADER>
		<INDEX_EXTENSION_DATA>
	;

<INDEX_EXTENSION_HEADER>
	:	<INDEX_EXTENSION_NAME> <INDEX_EXTENSION_DATA_SIZE>
	;

<INDEX_EXTENSION_NAME>
	# 4 byte sequence identifying how the <INDEX_EXTENSION_DATA>
	# should be interpreted. The first byte having a value greater
	# than or equal to the ASCII character 'A' (0x41) and less than
	# or equal to the ASCII character 'Z' (0x5a).
	;

<INDEX_EXTENSION_DATA_SIZE>
	# 32 bit, network byte order, binary integer containg the
	# length of the <INDEX_EXTENSION_DATA> byte sequence.
	;

<INDEX_EXTENSION_DATA>
	# Sequence of bytes.
	;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  5:39 Git files data formats documentation A Large Angry SCM
@ 2006-08-05  5:48 ` Jon Smirl
  2006-08-05  6:51 ` Junio C Hamano
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Jon Smirl @ 2006-08-05  5:48 UTC (permalink / raw)
  To: gitzilla; +Cc: git

You might make some notes about old format headers and new format ones
and the use_legacy_headers flag.

I started off looking at packs so I knew about TYPE_AND_BASE128_SIZE.
Next I wanted to write objects so I looked at sha1_file.c. If you
don't look at the code closely write_binary_header() will lead you to
believe that object files use TYPE_AND_BASE128_SIZE. It took me a
couple of hours to notice use_legacy_headers and discover that it
defaults to on.

Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  5:39 Git files data formats documentation A Large Angry SCM
  2006-08-05  5:48 ` Jon Smirl
@ 2006-08-05  6:51 ` Junio C Hamano
  2006-08-05 19:30   ` Shawn Pearce
  2006-08-05 21:56   ` A Large Angry SCM
  2006-08-05 16:22 ` Shawn Pearce
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2006-08-05  6:51 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: git

Good documentation, but some nitpicks are needed before it hits
Documentation/technical/ part of the source tree.

> <TREE_ENTRIES>
> 	# Tree entries are sorted by the byte sequence that comprises
> 	# the entry name.
> 	:	( <TREE_ENTRY> )*
> 	;

Not quite.  An entry for a subtree is sorted as if a '/' is
suffixed to its name.

        $ git ls-tree $T
        100644 blob 2398e9f8892812607f5eee6ed0d5712c2e3de197	a-
        100644 blob 7f07527a80bd8c2b1c5087d7ccfe61073b068374	a-b
        040000 tree 23fddf6a57ff3ba98aa93fb71431276c3f1a3c40	a
        100644 blob 2afe6dcc5466068b8dcc7263cece05d2adf044fe	a=
        100644 blob efc73add7dd868242a66faf2a59b145f2a60b834	a=b

This is, by the way, consistent with the order of cache entries
in the index file.

        $ git ls-files -s
        100644 2398e9f8892812607f5eee6ed0d5712c2e3de197 0	a-
        100644 7f07527a80bd8c2b1c5087d7ccfe61073b068374 0	a-b
        100644 0ee729686ab2a0074639c5f64930648571e7c4b2 0	a/b
        100644 2afe6dcc5466068b8dcc7263cece05d2adf044fe 0	a=
        100644 efc73add7dd868242a66faf2a59b145f2a60b834 0	a=b

> <TREE_ENTRY>
> 	# The type of the object referenced MUST be appropriate for
> 	# the mode. Regular files and symbolic links reference a BLOB
> 	# and directories reference a TREE.
> 	:	<OCTAL_MODE> <SP> <NAME> <NUL> <BINARY_OBJ_ID>
> 	;

As you correctly explain later, OCTAL_MODE must be minimal; "git
ls-tree" output says 040000 in the above example, but the actual
object records it as 40000.

> <TAG_CONTENTS>
> 	:	"object" <SP> <HEX_OBJ_ID> <LF>
> 		"type" <SP> <NONTAG_OBJ_TYPE> <LF>
> 		"tag" <SP> <TAG_NAME> <LF>
> 		<LF>
> 		<DATA>
> 	;

A tag can tag another tag (think of chain of trust), so what
follows "type" does not have to be NONTAG_OBJ_TYPE.

> <OCTAL_MODE>
> 	# Octal encoding, without prefix, of the file system object
> 	# type and permission bits. The bit layout is according to the
> 	# POSIX standard, with only regular files, directories, and
> 	# symbolic links permitted. The actual permission bits are
> 	# all zero except for regular files. The only permission bit
> 	# of any consequence to Git is the owner executable bit. By
> 	# default, the permission bits for files will be either 0644
> 	# or 0755, depending on the owner executable bit.
> 	;

It's not really "by default" -- more like "by definition", since
there is no way for the program to use something different.  We
used to record non-canonical modes in ancient versions of git,
but I think fsck-objects would warn on objects created that way.

> <NONTAG_OBJ_TYPE>
> 	:	"BLOB"
> 	|	"TREE"
> 	|	"COMMIT"
> 	;

Drop this definition, and make the literals part of <OBJ_TYPE>,
after lowercasing them ;-).

> <OBJ_TYPE>
> 	:	<NONTAG_OBJ_TYPE>
> 	|	"TAG"
> 	;

> PACK FILE
> ---------
> # The name of a pack file is "pack-${PACK_ID}.pack", where ${PACK_ID}
> # is the hexidecimal encoding (lower case) of the SHA-1 digest of the
> # sorted list of binary object IDs in the pack file without a separator
> # between the object IDs. Initially, the ${PACK_ID} for a pack was not
> # defined, making the value effectively random.

Although the really core level does not care, a PACK_ID is
required to be unique (within a object store and its alternates)
40-byte hexadecimal for http commit walker to work properly.

BTW, I still have a patch to tighten the check to enforce this
as part of the consistency check.

> <PACKED_OBJECT_DATA>
> 	:	_deflate_( <DATA> )
> 	|	<BINARY_OBJ_ID> _deflate_( <DELTA_DATA> )
> 	;

It might be cleaner to separate this definition into two.  That
is, one packed object is either non-delta-type base128 type-length
followed by deflated data, or delta-type base128 type-length
followed by base object id followed by deflated delta.

> PACK INDEX
> ----------
> # The name of a pack file index is "pack-${PACK_ID}.idx", where
> # ${PACK_ID} is the hexidecimal encoding (lower case) of the SHA-1
> # digest of the sorted list of binary object IDs in the pack file
> # without a separator between the object IDs. Initially, the ${PACK_ID}
> # for a pack was not defined, making the value effectively random.

I would not repeat ", where ${PACK_ID} is..." part, which was
done in the description of the pack file.  Rather, ", where
${PACK_ID} is same as the .pack file the .idx file corresponds
to", would be more appropriate.

> <INDEX_PARTIAL_COUNT>
> 	# 32 bit, network byte order, binary integer of the count of
> 	# objects in the pack file with the first byte of the object
> 	# ID less than or equal to the index of the count, starting
> 	# from zero.
> 	;

Linus and I call this part "fan-out".

> <ENTRY_NAME>
> 	# File system entity name. Path is normalized and relative to
> 	# the working directory.
> 	;

Did you mention that the index entries are sorted by name?

> <INDEX_EXTENSION_NAME>
> 	# 4 byte sequence identifying how the <INDEX_EXTENSION_DATA>
> 	# should be interpreted. The first byte having a value greater
> 	# than or equal to the ASCII character 'A' (0x41) and less than
> 	# or equal to the ASCII character 'Z' (0x5a).
> 	;

This is not true, but the code needs better comments.  The
intention is that an extended section whose name starts with a
capital letter (such as "cache-tree extension" whose name is
"TREE") is purely optional, and if a software of different
version does not understand it, it can still safely keep using
the rest of the index.  If somebody introduces a new extended
section that _must_ be interpreted in order to fully understand
what the index file records, such an extended section can signal
that by having a name that do not start with a capital.  A
version of the software that does understand such extended
sections would have a case arm that covers such a name in the
switch statement you took this 'A' .. 'Z' from.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  6:51 ` Junio C Hamano
@ 2006-08-05 19:30   ` Shawn Pearce
  2006-08-05 21:56   ` A Large Angry SCM
  1 sibling, 0 replies; 13+ messages in thread
From: Shawn Pearce @ 2006-08-05 19:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: A Large Angry SCM, git

Junio C Hamano <junkio@cox.net> wrote:
> Not quite.  An entry for a subtree is sorted as if a '/' is
> suffixed to its name.
> 
>         $ git ls-tree $T
>         100644 blob 2398e9f8892812607f5eee6ed0d5712c2e3de197	a-
>         100644 blob 7f07527a80bd8c2b1c5087d7ccfe61073b068374	a-b
>         040000 tree 23fddf6a57ff3ba98aa93fb71431276c3f1a3c40	a
>         100644 blob 2afe6dcc5466068b8dcc7263cece05d2adf044fe	a=
>         100644 blob efc73add7dd868242a66faf2a59b145f2a60b834	a=b
> 
> This is, by the way, consistent with the order of cache entries
> in the index file.

Arrrrgh.  I didn't realize that '/' was needed on the end of a tree
name when sorting its parent for output.  jgit was/is definately
doing this wrong.  And it all comes back to how the index operates,
doesn't it?  :-)

I've got to go back now and do some surgery on how jgit sorts
entries in a tree.  Clearly it would be incorrect with the example
you just gave.  It also would have thought that core GIT generated
a corrupt tree if it tried to read in your example.  Thank you for
taking the time to clarify it!

-- 
Shawn.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  6:51 ` Junio C Hamano
  2006-08-05 19:30   ` Shawn Pearce
@ 2006-08-05 21:56   ` A Large Angry SCM
  2006-08-06  0:37     ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: A Large Angry SCM @ 2006-08-05 21:56 UTC (permalink / raw)
  To: Junio C Hamano, git

Junio C Hamano wrote:
...
>> <OCTAL_MODE>
>> 	# Octal encoding, without prefix, of the file system object
>> 	# type and permission bits. The bit layout is according to the
>> 	# POSIX standard, with only regular files, directories, and
>> 	# symbolic links permitted. The actual permission bits are
>> 	# all zero except for regular files. The only permission bit
>> 	# of any consequence to Git is the owner executable bit. By
>> 	# default, the permission bits for files will be either 0644
>> 	# or 0755, depending on the owner executable bit.
>> 	;
> 
> It's not really "by default" -- more like "by definition", since
> there is no way for the program to use something different.  We
> used to record non-canonical modes in ancient versions of git,
> but I think fsck-objects would warn on objects created that way.
> 

See git-mktree.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05 21:56   ` A Large Angry SCM
@ 2006-08-06  0:37     ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2006-08-06  0:37 UTC (permalink / raw)
  To: gitzilla; +Cc: git

A Large Angry SCM <gitzilla@gmail.com> writes:

> Junio C Hamano wrote:
> ...
>>> <OCTAL_MODE>
>>> 	# Octal encoding, without prefix, of the file system object
>>> 	# type and permission bits. The bit layout is according to the
>>> 	# POSIX standard, with only regular files, directories, and
>>> 	# symbolic links permitted. The actual permission bits are
>>> 	# all zero except for regular files. The only permission bit
>>> 	# of any consequence to Git is the owner executable bit. By
>>> 	# default, the permission bits for files will be either 0644
>>> 	# or 0755, depending on the owner executable bit.
>>> 	;
>>
>> It's not really "by default" -- more like "by definition", since
>> there is no way for the program to use something different.  We
>> used to record non-canonical modes in ancient versions of git,
>> but I think fsck-objects would warn on objects created that way.
>>
>
> See git-mktree.

That's a bad example -- the tool being too loose.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  5:39 Git files data formats documentation A Large Angry SCM
  2006-08-05  5:48 ` Jon Smirl
  2006-08-05  6:51 ` Junio C Hamano
@ 2006-08-05 16:22 ` Shawn Pearce
  2006-08-05 17:31   ` A Large Angry SCM
  2006-08-05 18:41 ` Jakub Narebski
  2006-08-05 22:35 ` A Large Angry SCM
  4 siblings, 1 reply; 13+ messages in thread
From: Shawn Pearce @ 2006-08-05 16:22 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: git

A Large Angry SCM <gitzilla@gmail.com> wrote:
> This information may be useful for reading and writing the various Git 
> files.
[snip]
> 	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
> 	#	 | This reflects version 3 pack files |
> 	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
[snip]
> 	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
> 	#	 | This reflects version 2 pack files |
> 	#	\* -  -  -  -  -  -  -  -  -  -  -  - */

Thanks for taking the time to write these out.  The pack delta
formats were particularly helpful as it caused me to go back
and look at the unpacking code in jgit.

Apparently I wasn't handling the version 2 pack file correctly as I
didn't support copy-from-result; I had an infinite loop if the base
didn't decompress in one read (never happen right now, but could
in the future); and apparently my insert opcode implementation was
causing an infinite loop.  Nasty bugs.  I need to get more unit
tests written apparently.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05 16:22 ` Shawn Pearce
@ 2006-08-05 17:31   ` A Large Angry SCM
  0 siblings, 0 replies; 13+ messages in thread
From: A Large Angry SCM @ 2006-08-05 17:31 UTC (permalink / raw)
  To: Shawn Pearce, git

Shawn Pearce wrote:
> A Large Angry SCM <gitzilla@gmail.com> wrote:
>> This information may be useful for reading and writing the various Git 
>> files.
> [snip]
>> 	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
>> 	#	 | This reflects version 3 pack files |
>> 	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
> [snip]
>> 	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
>> 	#	 | This reflects version 2 pack files |
>> 	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
> 
> Thanks for taking the time to write these out.  The pack delta
> formats were particularly helpful as it caused me to go back
> and look at the unpacking code in jgit.
> 
> Apparently I wasn't handling the version 2 pack file correctly as I
> didn't support copy-from-result; I had an infinite loop if the base
> didn't decompress in one read (never happen right now, but could
> in the future); and apparently my insert opcode implementation was
> causing an infinite loop.  Nasty bugs.  I need to get more unit
> tests written apparently.  :-)

Keep in mind that the git-core code for reading version 2 or version 3 
pack files does _not_ handle copy-from-result correctly.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  5:39 Git files data formats documentation A Large Angry SCM
                   ` (2 preceding siblings ...)
  2006-08-05 16:22 ` Shawn Pearce
@ 2006-08-05 18:41 ` Jakub Narebski
  2006-08-05 20:15   ` A Large Angry SCM
  2006-08-05 22:35 ` A Large Angry SCM
  4 siblings, 1 reply; 13+ messages in thread
From: Jakub Narebski @ 2006-08-05 18:41 UTC (permalink / raw)
  To: git

A Large Angry SCM wrote:

> <TREE_ENTRY>
>         # The type of the object referenced MUST be appropriate for
>         # the mode. Regular files and symbolic links reference a BLOB
>         # and directories reference a TREE.
>         :       <OCTAL_MODE> <SP> <NAME> <NUL> <BINARY_OBJ_ID>
>         ;
[...]
> <OCTAL_MODE>
>         # Octal encoding, without prefix, of the file system object
>         # type and permission bits. The bit layout is according to the
>         # POSIX standard, with only regular files, directories, and
>         # symbolic links permitted. The actual permission bits are
>         # all zero except for regular files. The only permission bit
>         # of any consequence to Git is the owner executable bit. By
>         # default, the permission bits for files will be either 0644
>         # or 0755, depending on the owner executable bit.
>         ;

I do wonder why there is <OCTAL_MODE> (and not <BINARY_OCTAL_MODE>) 
but <BINARY_OBJ_ID> (and not <HEX_OBJ_ID>).

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05 18:41 ` Jakub Narebski
@ 2006-08-05 20:15   ` A Large Angry SCM
  2006-08-05 23:43     ` Jakub Narebski
  0 siblings, 1 reply; 13+ messages in thread
From: A Large Angry SCM @ 2006-08-05 20:15 UTC (permalink / raw)
  To: git

Jakub Narebski wrote:
> A Large Angry SCM wrote:
> 
>> <TREE_ENTRY>
>>         # The type of the object referenced MUST be appropriate for
>>         # the mode. Regular files and symbolic links reference a BLOB
>>         # and directories reference a TREE.
>>         :       <OCTAL_MODE> <SP> <NAME> <NUL> <BINARY_OBJ_ID>
>>         ;
> [...]
>> <OCTAL_MODE>
>>         # Octal encoding, without prefix, of the file system object
>>         # type and permission bits. The bit layout is according to the
>>         # POSIX standard, with only regular files, directories, and
>>         # symbolic links permitted. The actual permission bits are
>>         # all zero except for regular files. The only permission bit
>>         # of any consequence to Git is the owner executable bit. By
>>         # default, the permission bits for files will be either 0644
>>         # or 0755, depending on the owner executable bit.
>>         ;
> 
> I do wonder why there is <OCTAL_MODE> (and not <BINARY_OCTAL_MODE>) 
> but <BINARY_OBJ_ID> (and not <HEX_OBJ_ID>).
> 

<OCTAL_MODE> because it's an ASCII string. <BINARY_OBJ_ID> because it's 
the 20 byte digest.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05 20:15   ` A Large Angry SCM
@ 2006-08-05 23:43     ` Jakub Narebski
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Narebski @ 2006-08-05 23:43 UTC (permalink / raw)
  To: git

A Large Angry SCM wrote:

> Jakub Narebski wrote:

>> I do wonder why there is <OCTAL_MODE> (and not <BINARY_OCTAL_MODE>) 
>> but <BINARY_OBJ_ID> (and not <HEX_OBJ_ID>).
>> 
> 
> <OCTAL_MODE> because it's an ASCII string. <BINARY_OBJ_ID> because it's 
> the 20 byte digest.

I meant why git use ASCII string for octal mode, while using 20 byte digest
for object-id in tree format. More consistent would be use binary and
binary, or ASCII and ASCII (i.e. <HEX_OBJ_ID>).

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05  5:39 Git files data formats documentation A Large Angry SCM
                   ` (3 preceding siblings ...)
  2006-08-05 18:41 ` Jakub Narebski
@ 2006-08-05 22:35 ` A Large Angry SCM
  2006-08-16 16:55   ` Nicolas Pitre
  4 siblings, 1 reply; 13+ messages in thread
From: A Large Angry SCM @ 2006-08-05 22:35 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 119 bytes --]

A Large Angry SCM wrote:
> This information may be useful for reading and writing the various Git 
> files.

Revised.


[-- Attachment #2: dataformats.txt --]
[-- Type: text/plain, Size: 15198 bytes --]

Git files data formats
======================

OBJECTS
-------
# The object ID, or "name", of an object is
#	_sha-1_digest_( <OBJECT_HEADER> <object_CONTENTS> ).

<BLOB>
	:	_deflate_( <OBJECT_HEADER> <BLOB_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <BLOB_CONTENTS> )
	;

<BLOB_CONTENTS>
	:	<DATA>
	;

<TREE>
	:	_deflate_( <OBJECT_HEADER> <TREE_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <TREE_CONTENTS> )
	;

<TREE_CONTENTS>
	:	<TREE_ENTRIES>
	;

<TREE_ENTRIES>
	# Tree entries are sorted by the byte sequence that comprises
	# the entry name. However, for the purposes of the sort
	# comparison, entries for tree objects are compared as if the
	# entry name byte sequence has a trailing ASCII '/' (0x2f).
	:	( <TREE_ENTRY> )*
	;

<TREE_ENTRY>
	# The type of the object referenced MUST be appropriate for
	# the mode. Regular files and symbolic links reference a BLOB
	# and directories reference a TREE.
	:	<OCTAL_MODE> <SP> <NAME> <NUL> <BINARY_OBJ_ID>
	;

<COMMIT>
	:	_deflate_( <OBJECT_HEADER> <COMMIT_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <COMMIT_CONTENTS> )
	;

<COMMIT_CONTENTS>
	:	"tree" <SP> <HEX_OBJ_ID> <LF>
		( "parent" <SP> <HEX_OBJ_ID> <LF> )*
		"author" <SP>
			<SAFE_NAME> <SP>
			<LT> <SAFE_EMAIL> <GT> <SP>
			<GIT_DATE> <LF>
		"committer" <SP>
			<SAFE_NAME> <SP>
			<LT> <SAFE_EMAIL> <GT> <SP>
			<GIT_DATE> <LF>
		<LF>
		<DATA>
	;

<TAG>
	:	_deflate_( <OBJECT_HEADER> <TAG_CONTENTS> )
	|	<COMPACT_OBJECT_HEADER> _deflate_( <TAG_CONTENTS> )
	;

<TAG_CONTENTS>
	:	"object" <SP> <HEX_OBJ_ID> <LF>
		"type" <SP> <OBJ_TYPE> <LF>
		"tag" <SP> <TAG_NAME> <LF>
		<LF>
		<DATA>
	;

<OBJECT_HEADER>
	:	<OBJ_TYPE> <SP> <DECIMAL_LENGTH> <NUL>
	;

<COMPACT_OBJECT_HEADER>
	# The object type DELTA_ENCODED is not valid in a
	# <COMPACT_OBJECT_HEADER>.
	:	<TYPE_AND_BASE128_SIZE>
	;

<DATA>
	# Uninterpreted sequence of bytes.
	;

<OCTAL_MODE>
	# ASCII encoding of the octal encoding, without prefix or
	# leading zeros, of the file system object type and permission
	# bits. The bit layout is according to the POSIX standard, with
	# only regular files, directories, and symbolic links
	# permitted. The actual permission bits are all zero except for
	# regular files. The only permission bit of any consequence to
	# Git is the owner executable bit. By default, the permission
	# bits for files will be either 0644 or 0755, depending on the
	# owner executable bit.
	;

<NAME>
	# Sequence of bytes not containing the ASCII character byte
	# values NUL (0x00) or "/" (0x2f).
	;

<BINARY_OBJ_ID>
	# The object ID of the referenced object.
	;

<HEX_OBJ_ID>
	# Hexidecimal encoding (lower case) of the <BINARY_OBJ_ID>.
	;

<SAFE_NAME>
	:	<SAFE_STRING> 
	;

<SAFE_EMAIL>
	:	<SAFE_STRING>
	;

<SAFE_STRING>
	# A sequence of bytes not containing the ASCII character byte
	# values NUL (0x00), LF (0x0a), '<' (0c3c), or '>' (0x3e).
	#
	# The sequence may not begin or end with any bytes with the
	# following ASCII character byte values: SPACE (0x20),
	# '.' (0x2e), ',' (0x2c), ':' (0x3a), ';' (0x3b), '<' (0x3c),
	# '>' (0x3e), '"' (0x22), "'" (0x27).
	;

<GIT_DATE>
	:	<SECONDS> <SP> <TZ_OFFSET>
	;

<SECONDS>
	# Base 10, ASCII encoding of the number of seconds since 12:00
	# midnight January 1, 1970, UTC without accounting for leap
	# seconds, and without leading zeros.
	;

<TZ_OFFSET>
	# Signed offset of time zone from UTC.
	:	<TZ_OFFSET_SIGN> <TZ_OFFSET_HOURS> <TZ_OFFSET_MIN>
	;

<TZ_OFFSET_SIGN>
	:	"+"
	|	"-"
	;

<TZ_OFFSET_HOURS>
	:	<DIGIT> <DIGIT>
	;

<TZ_OFFSET_MIN>
	# Valid values are "00" to "59" inclusive.
	:	<DIGIT> <DIGIT>
	;

<DIGIT>
	# ASCII decimal digit.
	;

<OBJ_TYPE>
	:	"blob"
	|	"tree"
	|	"commit"
	|	"tag"
	;

<DECIMAL_LENGTH>
	# Base 10, ASCII encoding of the byte length of the object
	# contents, without leading zeros. The length value does not
	# include the length of the <OBJECT_HEADER>.
	:	( <DIGIT> )+
	;

<SP>
	# ASCII SPACE (0x20) character.
	;

<NUL>
	# ASCII NUL (0x00) character.
	;

<LF>
	# ASCII LF (0x0a) "line feed" character.
	;


PACK FILE
---------
# The name of a pack file is "pack-${PACK_ID}.pack", where ${PACK_ID}
# is the hexidecimal encoding (lower case) of the SHA-1 digest of the
# sorted list of binary object IDs in the pack file without a separator
# between the object IDs. Initially, the ${PACK_ID} for a pack was not
# well defined, effectively making the value 40 random hexidecimal
# (lower case) characters. Recent, non git-core, code depends on the
# uniqueness of ${PACK_ID} across all of the object databases used by
# a repository.

<PACK_FILE>
	:	<PACK_FILE_CONTENTS> <PACK_FILE_CHECKSUM>
	;

<PACK_FILE_CONTENTS>
	:	"PACK" <PACK_VERSION> <PACK_OBJECT_COUNT>
		( <PACKED_OBJECT_HEADER> <PACKED_OBJECT_DATA> )*
		<PACK_FILE_CHECKSUM>
	;

<PACK_VERSION>
	# 32 bit, network byte order, binary integer indicating which
	# version of the pack file format was used to create the pack
	# file.
	;

<PACK_OBJECT_COUNT>
	# 32 bit, network byte order, binary integer containg the
	# number of objects encoded in the pack file.
	;

<PACK_FILE_CHECKSUM>
	:	_sha-1_digest_( <PACK_FILE_CONTENTS> )
	;


<PACKED_OBJECT_HEADER>
	# If the object type is not a DELTA_ENCODED object, the packed
	# object data that follows is the deflated byte sequence of the
	# object without the Git object header. The length value is the
	# byte count of the inflated byte sequence of the object.
	#
	# If the object type is a DELTA_ENCODED object, what follows is
	# the ID of the base object and the deflated delta data to
	# transform the base object into the target object. The type of
	# the target object is the same as that of the base object and
	# the length value is the byte count of the inflated delta
	# data. The base object may also be DELTA_ENCODED but cyclic
	# base object chains are not permitted and the pack file MUST
	# contain all base objects.
	:	<TYPE_AND_BASE128_SIZE>
	;

<TYPE_AND_BASE128_SIZE>
	# A compact, variable length, encoding of the packed object
	# length and type. The first byte is comprised of 3 fields
	# (where bit 0 is the least significant bit in a byte):
	#	bit 7:		more flag
	#	bits 6-4:	object type
	#	bits 3-0:	least significant bits of the object
	#			length.
	# If the more flag is set, the next byte contains more object
	# length bits.
	# The object types corresponding to the object type bits are:
	#	6 5 4
	#	- - -
	#	0 0 0	invalid: Reserved
	#	0 0 1	COMMIT object
	#	0 1 0	TREE object
	#	0 1 1	BLOB object
	#	1 0 0	TAG object
	#	1 0 1	invalid: Reserved
	#	1 1 0	invalid: Reserved
	#	1 1 1	DELTA_ENCODED object
	#
	# If the more flag was set, the next byte will have more length
	# bits and will be comprised to 2 fields:
	#	bit 7:		more flag
	#	bits 6-0:	7 additional, more significant, bits of
	#			the object length
	# If the more flag is set, the next byte contains more object
	# length bits using the same encoding.
	;

<PACKED_OBJECT_DATA>
	:	_deflate_( <DATA> )
	|	<BINARY_OBJ_ID> _deflate_( <DELTA_DATA> )
	;

<DELTA_DATA>
	# Size of the base object encoded as a base 128 number, least
	# significant bits first, using bit 7 (the most significant
	# bit) of each byte to indicate that more bits follow.
	#
	# Size of the result object encoded as a base 128 number, using
	# the same method as used for the base object size.
	#
	# There will then be a sequence of delta hunks.
	# Zero as the value of the first byte of a hunk in reserved.
	#
	# If bit 7 of the first byte of a delta hunk is not set, the
	# hunk is an "insert" hunk and bits 0-6 specify the number of
	# bytes to append to the output buffer from the hunk.
	#
	# If bit 7 of the first byte of a delta hunk is set, the hunk
	# is a "copy" hunk and bits 0-6 specify how the remaining
	# bytes in the hunk make up the base offset and length for the
	# copy. The following C code demonstrate how to determine the
	# base offset and length for the copy:
	#
	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
	#	 | This reflects version 3 pack files |
	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
	#
	#	byte *data = delta_hunk_start
	#	opcode = *data++
	#	off_t copy_offset= 0;
	#	size_t copy_length = 0;
	#
	#	for (shift=i=0; i<4; i++) {
	#		if (opcode & 0x01) {
	#			copy_offset |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	for (shift=i=0; i<3; i++) {
	#		if (opcode & 0x01) {
	#			copy_length |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	if (!copy_length) {
	#		copy_length = 1<<16;
	#		}
	#
	# For version 2 pack files, the size of a copy is limited to
	# 64K bytes or less and bit 6 of the opcode byte is set if the
	# source of the copy is from the buffer of the result object
	# instead of the the base object.
	#
	# It's unknown if any version 2 pack files were created with
	# bit 6 set in the opcode byte; however, the change that added
	# support for version 3 pack files removed the code that would
	# change the copy source to the result buffer.
	#
	#	/* -  -  -  -  -  -  -  -  -  -  -  - *\
	#	 | This reflects version 2 pack files |
	#	\* -  -  -  -  -  -  -  -  -  -  -  - */
	#
	#	byte *data = delta_hunk_start
	#	opcode = *data++
	#	off_t copy_offset= 0;
	#	size_t copy_length = 0;
	#
	#	for (shift=i=0; i<4; i++) {
	#		if (opcode & 0x01) {
	#			copy_offset |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	for (shift=i=0; i<2; i++) {
	#		if (opcode & 0x01) {
	#			copy_length |= (*data++)<<shift;
	#			}
	#		opcode >>= 1;
	#		shift += 8;
	#		}
	#
	#	if (!copy_length) {
	#		copy_length = 1<<16;
	#		}
	#
	#	copy_from_result = opcode & 0x01
	#
	;


PACK INDEX
----------
# The name of a pack file index is "pack-${PACK_ID}.idx", where
# ${PACK_ID} is the same as that of the pack file that the pack index
# corresponds to.

<PACK_INDEX>
	:	<PACK_INDEX_CONTENTS> <PACK_INDEX_CHECKSUM>
	;

<PACK_INDEX_CONTENTS>
	:	( <INDEX_PARTIAL_COUNT> ){256}
		( <PACK_OBJECT_OFFSET> <BINARY_OBJ_ID> )*
		<PACK_FILE_CHECKSUM>
	;

<INDEX_PARTIAL_COUNT>
	# 32 bit, network byte order, binary integer of the count of
	# objects in the pack file with the first byte of the object
	# ID less than or equal to the index of the count, starting
	# from zero.
	;

<PACK_OBJECT_OFFSET>
	# 32 bit, network byte order, binary integer giving the offset,
	# in bytes from the begining of the pack file, where the
	# encoding of the object starts.
	;

<PACK_INDEX_CHECKSUM>
	:	_sha-1_digest_( <PACK_INDEX_CONTENTS> )
	;


INDEX FILE (CACHE)
------------------

<INDEX_FILE>
	:	<INDEX_FILE_FORMAT_V1>
	|	<INDEX_FILE_FORMAT_V2>
	;

<INDEX_FILE_FORMAT_V1>
	# This format is no longer supported.
	:	<INDEX_HEADER> <INDEX_CHECKSUM> <INDEX_CONTENTS>
	;

<INDEX_FILE_FORMAT_V2>
	:	<INDEX_HEADER> <EXTENDED_INDEX_CONTENTS> <EXTENED_CHECKSUM>
	;

<INDEX_HEADER>
	:	"DIRC" <INDEX_FILE_VERSION> <INDEX_ENTRY_COUNT>
	;

<INDEX_FILE_VERSION>
	# 32 bit, network byte order, binary integer indicating which
	# version of the index file format was used to create the
	# index file.
	;

<INDEX_ENTRY_COUNT>
	# 32 bit, network byte order, binary integer containg the
	# number of index entries in the index file.
	;

<EXTENDED_CHECKSUM>
	:	_sha-1_digest_( <EXTENDED_INDEX_CONTENTS> )
	;

<INDEX_CHECKSUM>
	:	_sha-1_digest_( <INDEX_CONTENTS> )
	;

<INDEX_CONTENTS>
	# Index entries are sorted by the byte sequence that comprises
	# the entry name; with a secondary comparison of the stage bits
	# from the <ENTRY_FLAGS> if the entry name byte sequences are
	# identical.
	:	( <INDEX_ENTRY> )*
	;

<EXTENDED_INDEX_CONTENTS>
	:	<INDEX_CONTENTS> <INDEX_CONTENTS_EXTENSIONS>
	;

<INDEX_ENTRY>
	:	<INDEX_ENTRY_STAT_INFO>
		<ENTRY_ID>
		<ENTRY_FLAGS>
		<ENTRY_NAME>
		<NUL>
		<ENTRY_ZERO_PADDING>
	;

<ENTRY_ZERO_PADDING>
	# The minimum length 0x00 byte sequence necessary to make the
	# written of digested byte length of the <INDEX_ENTRY> a
	# multiple of 8.
	;

<INDEX_ENTRY_STAT_INFO>
	# These fields are used as a part of a heuristic to determine
	# if the file system entity associated with this entry has
	# changed. The names are very *nix centric but the exact
	# contents of each field have no meaning to Git, besides exact
	# match, except for the <ENTRY_MODE> and <ENTRY_SIZE> fields.
	:	<ENTRY_CTIME>
		<ENTRY_MTIME>
		<ENTRY_DEV>
		<ENTRY_INODE>
		<ENTRY_MODE>
		<ENTRY_UID>
		<ENTRY_GID>
		<ENTRY_SIZE>
	;

<ENTRY_CTIME>
	# The timestamp of the last status change of the associated
	# file system entity.
	:	<ENTRY_TIME>
	;

<ENTRY_MTIME>
	# The timestamp of the last modification of the associated
	# file system entity.
	:	<ENTRY_TIME>
	;

<ENTRY_TIME>
	:	<TIME_LSB32> <TIME_NSEC>
	;

<TIME_LSB32>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) timestamp.
	;

<TIME_NSEC>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) more precise
	# timestamp, if available.
	;

<ENTRY_DEV>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# device identifier. Use of this field is a compile time
	# option.
	;

<ENTRY_INODE>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) inode number, or
	# equivalent.
	;

<ENTRY_MODE>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# entity type and permissions.
	;

<ENTRY_UID>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# entity owner identifier.
	;

<ENTRY_GID>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) file system
	# entity group identifier, or equivalent.
	;

<ENTRY_SIZE>
	# 32 bit, network byte order, binary integer containg the lower
	# 32 bits of the entry (file or symbolic link) size.
	;

<ENTRY_ID>
	# Object ID of the of the file system entity contents.
	;

<ENTRY_FLAGS>
	# 16 bit, network byte order, binary integer.
	#	bits 15-14	Reserved
	#	bits 13-12	Entry stage
	#	bits 11-0	Name byte length
	#
	# See git-read-tree(1) for a description of how the stage
	# field is used.
	;

<ENTRY_NAME>
	# File system entity name. Path is normalized and relative to
	# the working directory.
	;

<INDEX_CONTENTS_EXTENSIONS>
	:	( <INDEX_EXTENSION> )*
	;

<INDEX_EXTENSION>
	:	<INDEX_EXTENSION_HEADER>
		<INDEX_EXTENSION_DATA>
	;

<INDEX_EXTENSION_HEADER>
	:	<INDEX_EXTENSION_NAME> <INDEX_EXTENSION_DATA_SIZE>
	;

<INDEX_EXTENSION_NAME>
	# 4 byte sequence identifying how the <INDEX_EXTENSION_DATA>
	# should be interpreted. If the first byte has a value greater
	# than or equal to the ASCII character 'A' (0x41) and less than
	# or equal to the ASCII character 'Z' (0x5a), the extension is
	# optional and does not affect the interpretation of the other
	# contents in the index file. Any non-optional extensions must
	# be understood by the reading application to correctly
	# interpret the index file contents.
	;

<INDEX_EXTENSION_DATA_SIZE>
	# 32 bit, network byte order, binary integer containg the
	# length of the <INDEX_EXTENSION_DATA> byte sequence.
	;

<INDEX_EXTENSION_DATA>
	# Sequence of bytes.
	;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git files data formats documentation
  2006-08-05 22:35 ` A Large Angry SCM
@ 2006-08-16 16:55   ` Nicolas Pitre
  0 siblings, 0 replies; 13+ messages in thread
From: Nicolas Pitre @ 2006-08-16 16:55 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: git

A Large Angry SCM wrote:
> This information may be useful for reading and writing the various Git
> files.

[...]

        # For version 2 pack files, the size of a copy is limited to
        # 64K bytes or less and bit 6 of the opcode byte is set if the
        # source of the copy is from the buffer of the result object
        # instead of the the base object.
        #
        # It's unknown if any version 2 pack files were created with
        # bit 6 set in the opcode byte; however, the change that added
        # support for version 3 pack files removed the code that would
        # change the copy source to the result buffer.

There were no version 2 pack files with bit 6 set in the opcode byte 
ever produced (except on my own hard disk when I was experimenting with 
that feature).  The (negative) compression gain turned up to be not 
worth the needed computational cost to make use of it, hence that bit is 
now dedicated to specifying an extra size byte.

See commit d60fc1c8649f80c006b9f493c542461e81608d4b log message for 
more.


Nicolas

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-08-16 16:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-05  5:39 Git files data formats documentation A Large Angry SCM
2006-08-05  5:48 ` Jon Smirl
2006-08-05  6:51 ` Junio C Hamano
2006-08-05 19:30   ` Shawn Pearce
2006-08-05 21:56   ` A Large Angry SCM
2006-08-06  0:37     ` Junio C Hamano
2006-08-05 16:22 ` Shawn Pearce
2006-08-05 17:31   ` A Large Angry SCM
2006-08-05 18:41 ` Jakub Narebski
2006-08-05 20:15   ` A Large Angry SCM
2006-08-05 23:43     ` Jakub Narebski
2006-08-05 22:35 ` A Large Angry SCM
2006-08-16 16:55   ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).