Re: Index format v5 - Michael Haggerty

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Mon, 07 May 2012 17:15:15 +0200	[thread overview]
Message-ID: <4FA7E703.7040408@alum.mit.edu> (raw)
In-Reply-To: <CALgYhfMKdbv8TiT4ALDSvD3pSXHEPLWHM09DxYnRmRdBWRjh8Q@mail.gmail.com>

On 05/03/2012 07:25 PM, Thomas Gummerer wrote:
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.
>
>
>
> GIT index format
> ================
 > [...]

Here are some comments about the format document, version 55047b3d. 
They probably overlap with other feedback that you have gotten, but I 
don't have time to cross-correlate everything.  Hope it helps.

Overall
=======

I find the format definition pretty difficult to read.  The following
suggestions might make it easier to follow.

* You might want to give each of the elements C-like names; e.g.,
   "ndir (32 bits): number of directories in the index".  These names
   could be used as unambiguous references from other parts of the
   documentation.  The names could also be used in the implementing
   code, to make the correspondence with the format definition easier
   to follow.

* Name things consistently.  For example, "A number of sorted
   directories (see below)" should be changed to "A number of directory
   entries (see below), sorted by path name".

* Putting the above two things together, the description "A number of
   sorted directories" might become "direntries (ndir * directory
   entries): one directory entry for each of the ndir directories in
   the index, sorted by path name".

* Are all "offsets" relative to the start of the index file?  If so,
   it would be good to state this explicitly at the start (and maybe
   rename them "file positions"?).  If not, please be very explicit
   about where each offset is measured from, and whether it is signed
   or unsigned.

* You seem to switch randomly between counting sizes in bits vs bytes.
   Please try to be more consistent.  BTW, I think the size of an SHA1
   object name is more commonly described as "20 bytes" rather than
   "160 bits".

* The details of the extension data blocks are described in the first
   (overview) section, whereas it seems like they should be described
   in their own section following the "conflict data" section.  But
   wouldn't the presence of extension data blocks prevent the addition
   of conflict data?

* Are there situations (for example during conflicts?) when it is
   allowed to have directory and file entries with the same name?

* Does the index file include directory entries for empty directories?
   What about directories that contain only other directories?

Overview
========

* Does "32-bit size of the extension" include the whole extension data
   block (including header) or only the "extension data"?

Directory entry
===============

* "4-byte number of entries in the index that is covered by the tree
   this entry represents."  What does this include?
   Files/directories/both?  Recursive or non-recursive?

* "160-bit object name for the object that would result from writing
   this span of index as a tree."  Is this always valid?

* It might be convenient to store directory names with trailing '/'
   characters (except for the top-level directory, which can be stored
   as "").  That way (1) it is trivial to concatenate the directory
   name and the filename to produce the file path; (2) directories and
   files can be distinguished by name and therefore intermingled in
   lists; (3) such lists can be sorted using strcmp() without having to
   simulate an invisible '/' character.

File entry
==========

* I believe that only the basename is stored, not the whole path.  But
   then why is the encoding for '/' specified (there should be no '/'
   characters)?

* Why is the encoding of '.' specified?  Is it somehow significant for
   the index file format?

* Are file entries sorted by entire path or only by the basename?

Flat loading
============

* I found the explanation pretty incomprehensible.  Perhaps some
   pseudo-code would make it clearer?

* Since I can't understand the explanation, I'm not sure if this
   comment makes any sense.  But when traversing into a subdirectory,
   don't *all* of the remaining files from the parent directory need to
   be tucked away somewhere?

* At an even higher level of abstraction, your directory entry
   contains a count "number of entries in the index that is covered by
   the tree this entry represents".  If this count is recursive, then
   it seems like it would be enough to know how many entries will come
   from traversing a whole subdirectory tree.  So it should be possible
   to skip that many entries in the in-memory array and continue
   reading the file entries for the parent subdirectory.  For example,
   suppose our files are [A, B/1, B/2/a, B/2/b, B/3, C].  If I start by
   reading the files in the root directory, then I can fill the index
   array entries

       [A, -, -, -, -, C]

   because when I see that "B" is a directory containing a total of
   four entries, I just leave fours spaces for them in the array and
   continue with the next file, "C".  Of course I would have to
   remember (e.g., on a queue) that directory "B" still needs to be
   processed, and the starting index in the array where its entries
   have to be filled in.  Still, this jumping around would be in the
   RAM holding the index array pointers rather than in the file
   positions.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

next prev parent reply	other threads:[~2012-05-07 15:22 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty [this message]
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FA7E703.7040408@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=t.gummerer@gmail.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).