Re: Index format v5 - Thomas Gummerer

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Gummerer <t.gummerer@gmail.com>
To: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	mhagger@alum.mit.edu, peff@peff.net, spearce@spearce.org,
	davidbarr@google.com
Subject: Re: Index format v5
Date: Fri, 4 May 2012 17:44:24 +0200	[thread overview]
Message-ID: <20120504154424.GA923@tgummerer.unibz.it> (raw)
In-Reply-To: <CACsJy8B9p1Z_eW20mZwBLwRnFWHstEdRxmw7GujECpMKByfBEg@mail.gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 4205 bytes --]



On 05/04, Nguyen Thai Ngoc Duy wrote:
> On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > GIT index format
> > ================
> >
> > = The git index file has the following format
> >
> >  All binary numbers are in network byte order. Version 5 is described
> >  here.
> >   ...
> >   - A number of directory offsets (see below). [1]
> >
> >   - A number of sorted directories (see below). [2]
> >
> >   - 32-bit crc32 checksum for the header, extension offsets and directories.
> 
> So we use one checksum for all dirs? I thought we could do checksum
> per dir, so if I'm interested in path/to/here only, I only need to
> verify data of three directories.

Good point. Not sure how they could exactly be implemented, but probably
one checksum for offset + directory data. I'll definitely think about this.

> > == Directory entry offsets
> >
> >  32-bit offset to the directory.
> >
> >  This part is needed for making the directory entries bisectable and
> >    thus allowing a binary search.
> 
> How is this (I assume) array ordered? The same top-down depth-first
> with "Directory entry" section below? I can see ordering as
> top-down/breadth-first help bsearch though.

True, the breadth-first approach might be better, since we are using
prefix compression for the pathname. It will need some more offsets
(or calculation, but should still be faster)

> > == Directory entry
> >
> >  Directory entries are sorted in lexicographic order by the name
> >  of their path starting with the root.
> >
> >  Path names (variable length) relative to top level directory (without the
> >    leading slash). '/' is used as path separator. '.' indicates the root
> >    directory. The special patch components ".." and ".git" (without quotes)
> >    are disallowed. Trailing slash is also disallowed.
> >
> >  1 nul byte to terminate the path.
> 
> I don't see it mention prefix compression here, nor in "file entry"
> section. Does it use it here? If so I don't think prefix compression
> plays well with bsearch (on path name). In the worst case you may have
> to process up to the first entry in order to get a path name (e.g. a
> directory with entries "a", "aa", "aaa", "aaaa"...)

I planned to use prefix compression here, which would benefit especially
the reader (we're reading more often then writing). By designing the
offsets carefully we should still be able to get log(n) (n = number of
directories in the index) search time for a directory.

> >  The entries are written out in the top-down, depth-first order. The
> >    first entry represents the root level of the repository, followed by
> >    the first subtree - let's call it A - of the root level, followed by
> >    the first subtree of A, ...
> 
> So depth-first traversal becomes natural even without the help of
> directory offset table above. Nice.
> 
> > == File entry
> >
> >  File entries are sorted in ascending order on the name field, after the
> >  respective offset given by the directory entries.
> 
> I wonder if we need to keep file entry table separate from directory
> entry. It feels more natural to put the sequence of file entries of a
> directory right after the directory entry, might help read-ahead too
> during traversal. You save 4 bytes (for file entry offset) in each
> directory entry. You still have file offset table for random access.

The reason for this design choice is the fast searching of a directory, 
(for partial reading or changing a single file in the index). Keeping
them separate also simplifies the reading of the cache-tree, which will
be included in the directory section. Instead of offsets to the first file
we'd need offsets to the next directory to enable fast reading of the
cache-tree.

> >  File name (variable length). Nul bytes are not allowed in file names and
> >    they have no leading slash. They are 7-bit ASCII encoded.
> 
> Why can't it be 8-bit? I suppose file name is also prefix compressed?

I changed that, the file name can have UTF8 or ASCII encoding, as it was
allowed in the old index.

--
Thomas

next prev parent reply	other threads:[~2012-05-04 15:44 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer [this message]
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120504154424.GA923@tgummerer.unibz.it \
    --to=t.gummerer@gmail.com \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).