From: Thomas Gummerer <t.gummerer@gmail.com>
To: git@vger.kernel.org
Cc: trast@student.ethz.ch, gitster@pobox.com, mhagger@alum.mit.edu,
peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Index format v5
Date: Thu, 3 May 2012 19:25:12 +0200 [thread overview]
Message-ID: <CALgYhfMKdbv8TiT4ALDSvD3pSXHEPLWHM09DxYnRmRdBWRjh8Q@mail.gmail.com> (raw)
I have been drafting the Version 5 of the index format over the past
few days with the help of Thomas Rast, Michael Haggerty, cmn and
barrbrain on IRC. It will save with prefix compression on the path, and
using a crc32 over the stat data, instead of the full data, since it is only
used for checking if the file is changed. (Thanks Michael Haggerty for
this hint. Unless we are missing something this will save another
~4 MB on the Webkit index.
GIT index format
================
= The git index file has the following format
All binary numbers are in network byte order. Version 5 is described
here.
- A 16-byte header consisting of
4-byte signature:
The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
4-byte version number:
The current supported versions are 2, 3, 4 and 5.
32-bit number of directories.
32-bit number of file entries.
- Offset to the extensions.
32-bit number of extensions.
32-bit number offset to the extension. (Possibly none, as many as
indicated in the 4-byte number of extensions)
- A number of directory offsets (see below). [1]
- A number of sorted directories (see below). [2]
- 32-bit crc32 checksum for the header, extension offsets and directories.
- A number of file offsets (see below). [1]
- A number of file entries (see below).
- A number of entries for conflicted data/resolved conflicts (see below).
- Extensions
Extensions are identified by signature. Optional extensions can
be ignored if GIT does not understand them.
GIT supports an arbitrary number of extension, but currently none
is implemented. [3]
4-byte extension signature. If the first byte is 'A'..'Z' the
extension is optional and can be ignored.
32-bit size of the extension
32-bit crc32 checksum of the extension signature and size
Extension data
== Directory entry offsets
32-bit offset to the directory.
This part is needed for making the directory entries bisectable and
thus allowing a binary search.
== Directory entry
Directory entries are sorted in lexicographic order by the name
of their path starting with the root.
Path names (variable length) relative to top level directory (without the
leading slash). '/' is used as path separator. '.' indicates the root
directory. The special patch components ".." and ".git" (without quotes)
are disallowed. Trailing slash is also disallowed.
1 nul byte to terminate the path.
32-bit offset to the first file of a directory
32-bit offset to conflicted/resolved data at the end of the index.
0 if there is no such data. [4]
4-byte number of subtrees this tree has
4-byte number of entries in the index that is covered by the tree this
entry represents. (entry_count) (-1 if the entry is invalid)
160-bit object name for the object that would result from writing
this span of index as a tree.
The last 24 bytes are for the cache tree. An entry can be in an
invalidated state which is represented by having -1 in the entry_count
field. If an entry is in invalidated state, the next entry will begin
after the number of subtrees, and the 160-bit object name is dropped.
The entries are written out in the top-down, depth-first order. The
first entry represents the root level of the repository, followed by
the first subtree - let's call it A - of the root level, followed by
the first subtree of A, ...
== File entry offsets
32-bit offset to the directory.
This part is needed for making the directory entries bisectable and
thus allowing a binary search.
== File entry
File entries are sorted in ascending order on the name field, after the
respective offset given by the directory entries.
File name (variable length). Nul bytes are not allowed in file names and
they have no leading slash. They are 7-bit ASCII encoded.
1 nul byte to terminate the filename.
A 16-bit 'flags' field split into (high to low bits)
1-bit assume-valid flag
1-bit conflict flag
2-bit stage (during merge)
2-bit mode (0 = 1000644 (regular file without execution
permission), 1 = 1000755 (regular file with execution
permission), 2 = 1010000 (symbolic link), 3 = 1110
(gitlink)) [5]
1-bit skip-worktree flag (used by sparse checkout)
1-bit intent-to-add flag (used by "git add -N")
8-bit unused, must be zero [6]
32-bit mtime seconds, the last time a file's data changed
this is stat(2) data
32-bit mtime nanosecond fractions
this is stat(2) data
32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
160-bit SHA-1 for the represented object
32-bit crc32 checksum for the file entry
== Conflicted data
A conflict is represented in the index as a set of higher stage entries.
These entries are stored at the end of the index. When a conflict is
resolved (e.g. with "git add path"). A bit is flipped, to indicate that
the conflict is resolved, but the entries will be kept, so that
conflicts can be recreated (e.g. with "git checkout -m", in case users
want to redo a conflict resolution from scratch.
- NUL-terminated filename of the entry
- A 8-bit 'flags' field split into:
- 1-bit conflicted state (conflicted/resolved) (1 if conflicted)
- 7-bit unused
- Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
missing stage is represented by "0" in this field);
and
- At most three 160-bit object names of the entry in stages from 1 to 3
(nothing is written for a missing stage).
- 32-bit crc32 checksum over one conflicted entry.
== Design explanations
[1] The directory and file offsets are included in the index format
to enable bisectability of the index, for binary searches.Updating
a single entry and partial reading will benefit from this.
[2] The directories are saved in their own block, to be able to
quickly search for a directory in the index. They include a
offset to the (lexically) first file in the directory.
[3] The data of the cache-tree extension and the resolve undo
extension is now part of the index itself, but if other extensions
come up in the future, there is no need to change the index, they
can simply be added at the end.
[4] To avoid rewrites of the whole index when there are conflicts or
conflicts are being resolved, conflicted data will be stored at
the end of the index. To mark the conflict resolved, just a bit
has to be flipped. The data will still be there, if a user wants
to redo the conflict resolution.
[5] Since only 4 modes are effectively allowed in git but 32-bit are
used to store them, having a two bit flag for the mode is enough
and saves 4 byte per entry.
[6] The length of the file name was dropped, since each file name is
nul terminated anyway.
[7] Since all stat data (except mtime and ctime) is just used for
checking if a file has changed a checksum of the data is enough.
In addition to that Thomas Rast suggested ctime could be ditched
completely (core.trustctime=false) and thus included in the
checksum. This would save 24 bytes per index entry, which would
be about 4 MB on the Webkit index.
(Thanks for the suggestion to Michael Haggerty)
next reply other threads:[~2012-05-03 17:25 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-03 17:25 Thomas Gummerer [this message]
2012-05-03 18:16 ` Index format v5 Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
2012-05-07 22:18 ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer
2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALgYhfMKdbv8TiT4ALDSvD3pSXHEPLWHM09DxYnRmRdBWRjh8Q@mail.gmail.com \
--to=t.gummerer@gmail.com \
--cc=davidbarr@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=mhagger@alum.mit.edu \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).