From: Thomas Gummerer <t.gummerer@gmail.com>
To: git@vger.kernel.org
Cc: trast@student.ethz.ch, mhagger@alum.mit.edu, gitster@pobox.com,
pcouds@gmail.com, robin.rosenberg@dewire.com,
Thomas Gummerer <t.gummerer@gmail.com>
Subject: [PATCH/RFC v2 07/16] Add documentation of the index-v5 file format
Date: Sun, 5 Aug 2012 23:49:04 +0200 [thread overview]
Message-ID: <1344203353-2819-8-git-send-email-t.gummerer@gmail.com> (raw)
In-Reply-To: <1344203353-2819-1-git-send-email-t.gummerer@gmail.com>
Add a documentation of the index file format version 5 to
Documentation/technical.
Helped-by: Michael Haggerty <mhagger@alum.mit.edu>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
Documentation/technical/index-file-format-v5.txt | 281 ++++++++++++++++++++++
1 file changed, 281 insertions(+)
create mode 100644 Documentation/technical/index-file-format-v5.txt
diff --git a/Documentation/technical/index-file-format-v5.txt b/Documentation/technical/index-file-format-v5.txt
new file mode 100644
index 0000000..6253e34
--- /dev/null
+++ b/Documentation/technical/index-file-format-v5.txt
@@ -0,0 +1,281 @@
+GIT index format
+================
+
+== The git index file format
+
+ The git index file (.git/index) documents the status of the files
+ in the git staging area.
+
+ The staging area is used for preparing commits, merging, etc.
+
+ All binary numbers are in network byte order. Version 5 is described
+ here.
+
+ - A 20-byte header consisting of
+
+ sig (32-bits): Signature:
+ The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
+
+ vnr (32-bits): Version number:
+ The current supported versions are 2, 3, 4 and 5.
+
+ ndir (32-bits): number of directories in the index.
+
+ nfile (32-bits): number of file entries in the index.
+
+ fblockoffset (32-bits): offset to the file block, relative to the
+ beginning of the file.
+
+ - Offset to the extensions.
+
+ nextensions (32-bits): number of extensions.
+
+ extoffset (32-bits): offset to the extension. (Possibly none, as
+ many as indicated in the 4-byte number of extensions)
+
+ headercrc (32-bits): crc checksum for the header and extension
+ offsets
+
+ - diroffsets (ndir * directory offsets): A directory offset for each
+ of the ndir directories in the index, sorted by pathname (of the
+ directory it's pointing to) (see below). The diroffsets are relative
+ to the beginning of the direntries block. [1]
+
+ - direntries (ndir * directory entries): A directory entry for each
+ of the ndir directories in the index, sorted by pathname (see
+ below). [2]
+
+ - fileoffsets (nfile * file offsets): A file offset for each of the
+ nfile files in the index (see below). The file offsets are relative
+ to the beginning of the fileentries block. [1]
+
+ - fileentries (nfile * file entries): A file entry for each of the
+ nfile files in the index (see below).
+
+ - crdata: A number of entries for conflicted data/resolved conflicts
+ (see below).
+
+ - Extensions (Currently none, see below in the future)
+
+ Extensions are identified by signature. Optional extensions can
+ be ignored if GIT does not understand them.
+
+ GIT supports an arbitrary number of extension, but currently none
+ is implemented. [3]
+
+ extsig (32-bits): extension signature. If the first byte is 'A'..'Z'
+ the extension is optional and can be ignored.
+
+ extsize (32-bits): size of the extension, excluding the header
+ (extsig, extsize, extchecksum).
+
+ extchecksum (32-bits): crc32 checksum of the extension signature
+ and size.
+
+ - Extension data.
+
+
+== Directory offsets (diroffsets)
+
+ diroffset (32-bits): offset to the directory relative to the beginning
+ of the index file. There are ndir + 1 offsets in the diroffset table,
+ the last is pointing to the end of the last direntry. With this last
+ entry, we can replace the strlen when reading each filename, by
+ calculating its length with the offsets.
+
+ This part is needed for making the directory entries bisectable and
+ thus allowing a binary search.
+
+== Directory entry (direntries)
+
+ Directory entries are sorted in lexicographic order by the name
+ of their path starting with the root.
+
+ pathname (variable length, nul terminated): relative to top level
+ directory (without the leading slash). '/' is used as path
+ separator. A string of length 0 ('') indicates the root directory.
+ The special path components ".", and ".." (without quotes) are
+ disallowed. The path also includes a trailing slash. [9]
+
+ foffset (32-bits): offset to the lexicographically first file in
+ the file offsets (fileoffsets), relative to the beginning of
+ the fileoffset block.
+
+ cr (32-bits): offset to conflicted/resolved data at the end of the
+ index. 0 if there is no such data. [4]
+
+ ncr (32-bits): number of conflicted/resolved data entries at the
+ end of the index if the offset is non 0. If cr is 0, ncr is
+ also 0.
+
+ nsubtrees (32-bits): number of subtrees this tree has in the index.
+
+ nfiles (32-bits): number of files in the directory, that are in
+ the index.
+
+ nentries (32-bits): number of entries in the index that is covered
+ by the tree this entry represents. (-1 if the entry is invalid).
+ This number includes all the files in this tree, recursively.
+
+ objname (160-bits): object name for the object that would result
+ from writing this span of index as a tree. This is only valid
+ if nentries is valid, meaning the cache-tree is valid.
+
+ flags (16-bits): 'flags' field split into (high to low bits) (For
+ D/F conflicts)
+
+ stage (2-bits): stage of the directory during merge
+
+ 14-bit unused
+
+ dircrc (32-bits): crc32 checksum for each directory entry.
+
+ The last 24 bytes (4-byte number of entries + 160-bit object name) are
+ for the cache tree. An entry can be in an invalidated state which is
+ represented by having -1 in the entry_count field.
+
+ The entries are written out in the top-down, depth-first order. The
+ first entry represents the root level of the repository, followed by
+ the first subtree - let's call it A - of the root level, followed by
+ the first subtree of A, ... There is no prefix compression for
+ directories.
+
+== File offsets (fileoffsets)
+
+ fileoffset (32-bits): offset to the file.
+
+ This part is needed for making the file entries bisectable and
+ thus allowing a binary search. There are nfile + 1 offsets in the
+ fileoffset table, the last is pointing to the end of the last
+ fileentry. With this last entry, we can replace the strlen when
+ reading each filename, by calculating its length with the offsets.
+
+== File entry (fileentries)
+
+ File entries are sorted in ascending order on the name field, after the
+ respective offset given by the directory entries. All file names are
+ prefix compressed, meaning the file name is relative to the directory.
+
+ filename (variable length, nul terminated). The exact encoding is
+ undefined, but the filename cannot contain a NUL byte (iow, the same
+ encoding as a UNIX pathname).
+
+ flags (16-bits): 'flags' field split into (high to low bits)
+
+ assumevalid (1-bit): assume-valid flag
+
+ intenttoadd (1-bit): intent-to-add flag, used by "git add -N".
+ Extended flag in index v3.
+
+ stage (2-bit): stage of the file during merge
+
+ skipworktree (1-bit): skip-worktree flag, used by sparse checkout.
+ Extended flag in index v3.
+
+ 11-bit unused, must be zero [6]
+
+ mode (16-bits): file mode, split into (high to low bits)
+
+ objtype (4-bits): object type
+ valid values in binary are 1000 (regular file), 1010 (symbolic
+ link) and 1110 (gitlink)
+
+ 3-bit unused
+
+ permission (9-bits): unix permission. Only 0755 and 0644 are valid
+ for regular files. Symbolic links and gitlinks have value 0 in
+ this field.
+
+ mtimes (32-bits): mtime seconds, the last time a file's data changed
+ this is stat(2) data
+
+ mtimens (32-bits): mtime nanosecond fractions
+ this is stat(2) data
+
+ statcrc (32-bits): crc32 checksum over ctime seconds, ctime
+ nanoseconds, ino, file size, dev, uid, gid (All stat(2) data
+ except mtime) [7]
+
+ objhash (160-bits): SHA-1 for the represented object
+
+# This will probably be changed in future versions as discussed here: http://colabti.org/irclogger/irclogger_log/git-devel?date=2012-06-21
+ entrycrc (32-bits): crc32 checksum for the file entry. The crc code
+ includes the offset to the file.
+
+== Conflict data
+
+ A conflict is represented in the index as a set of higher stage entries.
+ These entries are stored at the end of the index. When a conflict is
+ resolved (e.g. with "git add path"). A bit is flipped, to indicate that
+ the conflict is resolved, but the entries will be kept, so that
+ conflicts can be recreated (e.g. with "git checkout -m", in case users
+ want to redo a conflict resolution from scratch.
+
+ The first part of a conflict (usually stage 1) will be stored both in
+ the entries part of the index and in the conflict part. All other parts
+ will only be stored in the conflict part.
+
+ filename (variable length, nul terminated): filename of the entry,
+ relative to its containing directory).
+
+ nfileconflicts (32-bits): number of conflicts for the file [8]
+
+ flags (nfileconflicts * flags) (16-bits): 'flags' field split into:
+
+ conflicted (1-bit): conflicted state (conflicted/resolved) (1 if
+ conflicted)
+
+ stage (2-bits): stage during merge.
+
+ 13-bit unused
+
+ entry_mode (nfileconflicts * entry mode) (16-bits): octal numbers, entry
+ mode of eache entry in the different stages. (How many is defined by
+ the 4-byte number before)
+
+ objectnames (nfileconflicts * object names) (160-bits): object names
+ of the different stages.
+
+ conflictcrc (32-bits): crc32 checksum over conflict data.
+
+== Design explanations
+
+[1] The directory and file offsets are included in the index format
+ to enable bisectability of the index, for binary searches.Updating
+ a single entry and partial reading will benefit from this.
+
+[2] The directories are saved in their own block, to be able to
+ quickly search for a directory in the index. They include a
+ offset to the (lexically) first file in the directory.
+
+[3] The data of the cache-tree extension and the resolve undo
+ extension is now part of the index itself, but if other extensions
+ come up in the future, there is no need to change the index, they
+ can simply be added at the end.
+
+[4] To avoid rewrites of the whole index when there are conflicts or
+ conflicts are being resolved, conflicted data will be stored at
+ the end of the index. To mark the conflict resolved, just a bit
+ has to be flipped. The data will still be there, if a user wants
+ to redo the conflict resolution.
+
+[5] Since only 4 modes are effectively allowed in git but 32-bit are
+ used to store them, having a two bit flag for the mode is enough
+ and saves 4 byte per entry.
+
+[6] The length of the file name was dropped, since each file name is
+ nul terminated anyway.
+
+[7] Since all stat data (except mtime and ctime) is just used for
+ checking if a file has changed a checksum of the data is enough.
+ In addition to that Thomas Rast suggested ctime could be ditched
+ completely (core.trustctime=false) and thus included in the
+ checksum. This would save 24 bytes per index entry, which would
+ be about 4 MB on the Webkit index.
+ (Thanks for the suggestion to Michael Haggerty)
+
+[8] Since there can be more stage #1 entries, it is necessary to know
+ the number of conflict data entries there are.
+
+[9] As Michael Haggerty pointed out on the mailing list, storing the
+ trailing slash will simplify a few operations.
--
1.7.10.GIT
next prev parent reply other threads:[~2012-08-05 21:51 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-05 21:48 [PATCH/RFC v2 0/16] Introduce index file format version 5 Thomas Gummerer
2012-08-05 21:48 ` [PATCH/RFC v2 01/16] Modify cache_header to prepare for other index formats Thomas Gummerer
2012-08-06 1:17 ` Junio C Hamano
2012-08-07 12:41 ` Thomas Gummerer
2012-08-07 15:45 ` Junio C Hamano
2012-08-05 21:48 ` [PATCH/RFC v2 02/16] Modify read functions " Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 03/16] Modify match_stat_basic " Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 04/16] Modify write functions " Thomas Gummerer
2012-08-06 1:34 ` Junio C Hamano
2012-08-07 12:50 ` Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 05/16] t2104: Don't fail for index versions other than [23] Thomas Gummerer
2012-08-06 1:36 ` Junio C Hamano
2012-08-05 21:49 ` [PATCH/RFC v2 06/16] t3700: sleep for 1 second, to avoid interfering with the racy code Thomas Gummerer
2012-08-06 1:43 ` Junio C Hamano
2012-08-07 16:59 ` Thomas Gummerer
2012-08-08 20:16 ` Junio C Hamano
2012-08-08 20:57 ` Junio C Hamano
2012-08-09 13:19 ` Thomas Gummerer
2012-08-09 16:51 ` Junio C Hamano
2012-08-09 22:51 ` Thomas Gummerer
2012-08-05 21:49 ` Thomas Gummerer [this message]
2012-08-05 21:49 ` [PATCH/RFC v2 08/16] Make in-memory format aware of stat_crc Thomas Gummerer
2012-08-06 1:46 ` Junio C Hamano
2012-08-07 19:02 ` Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 09/16] Read index-v5 Thomas Gummerer
2012-08-06 5:17 ` Junio C Hamano
2012-08-08 7:41 ` Thomas Gummerer
2012-08-08 16:49 ` Junio C Hamano
2012-08-08 20:44 ` Thomas Gummerer
2012-08-08 21:50 ` Junio C Hamano
2012-08-05 21:49 ` [PATCH/RFC v2 10/16] Read resolve-undo data Thomas Gummerer
2012-08-06 1:51 ` Junio C Hamano
2012-08-07 19:17 ` Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 11/16] Read cache-tree in index-v5 Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 12/16] Write index-v5 Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 13/16] Write index-v5 cache-tree data Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 14/16] Write resolve-undo data for index-v5 Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 15/16] update-index.c: add a force-rewrite option Thomas Gummerer
2012-08-06 1:58 ` Junio C Hamano
2012-08-08 7:31 ` Thomas Gummerer
2012-08-05 21:49 ` [PATCH/RFC v2 16/16] p0002-index.sh: add perf test for the index formats Thomas Gummerer
2012-08-06 14:35 ` [PATCH/RFC v2 0/16] Introduce index file format version 5 Nguyễn Thái Ngọc Duy
2012-08-06 14:35 ` [PATCH 1/2] Move index v2 specific code out of read-cache Nguyễn Thái Ngọc Duy
2012-08-06 14:36 ` [PATCH 2/2] Add index-v5 Nguyễn Thái Ngọc Duy
2012-08-07 21:52 ` Robin Rosenberg
2012-08-08 10:54 ` Thomas Gummerer
2012-08-06 15:51 ` [PATCH/RFC v2 0/16] Introduce index file format version 5 Junio C Hamano
2012-08-06 16:06 ` Thomas Gummerer
2012-08-06 17:46 ` Junio C Hamano
2012-08-07 12:16 ` Nguyen Thai Ngoc Duy
2012-08-08 1:38 ` Junio C Hamano
2012-08-08 13:54 ` Nguyen Thai Ngoc Duy
2012-08-08 16:31 ` Junio C Hamano
2012-08-09 2:28 ` Nguyen Thai Ngoc Duy
2012-08-07 22:31 ` Thomas Rast
2012-08-07 23:26 ` Junio C Hamano
2012-08-08 9:07 ` Thomas Rast
2012-08-08 22:47 ` Junio C Hamano
2012-08-08 10:30 ` Nguyen Thai Ngoc Duy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1344203353-2819-8-git-send-email-t.gummerer@gmail.com \
--to=t.gummerer@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=mhagger@alum.mit.edu \
--cc=pcouds@gmail.com \
--cc=robin.rosenberg@dewire.com \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.