Re: Understanding version 4 packs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Peter Eriksen" <s022018@student.dtu.dk>
To: Nicolas Pitre <nico@cam.org>
Cc: git@vger.kernel.org
Subject: Re: Understanding version 4 packs
Date: Sun, 25 Mar 2007 10:35:30 +0200	[thread overview]
Message-ID: <20070325083530.GA25523@bohr.gbar.dtu.dk> (raw)
In-Reply-To: <alpine.LFD.0.83.0703241913110.18328@xanadu.home>

On Sat, Mar 24, 2007 at 07:24:17PM -0400, Nicolas Pitre wrote:
> On Sat, 24 Mar 2007, Peter Eriksen wrote:
> 
> > There is a new tree type called OBJ_DICT_TREE, which looks something
> > like the following:
> > 
> > +-----------------+------------------------------------------------+----
> > |  Table offset   |  SHA-1 of the blob corresponding to the path.  | ...
> > +-----------------+------------------------------------------------+----
> >       6 bytes                     20 bytes
> 
> Actually it is a 2-byte index in the path table, and a 4-byte index in a 
> common SHA1 table.  So each tree entry is 6 bytes total.

What happens to the paths, that do not have a correponding entry in the
path name table, because they are not among the 65535 most frequent
paths in the pack?

> > The index (.idx) files are extended to have a 4 byte pointer to the
> > offset of this file name table in the pack file for easy lookup.
> 
> Right.  And it will lose the SHA1 entries since they are already 
> available in the pack.

Does this mean, that the current index format will change from:

  - The header is followed by sorted 24-byte entries, one entry
    per object in the pack.  Each entry is:

    4-byte network byte order integer, recording where the
    object is stored in the packfile as the offset from the
    beginning.

to just 4-byte entries, and are the SHA-1 entries in that extra table
of SHA-1's referenced by OBJ_DICT_TREE objects in the pack file?

Regards,

Peter

P.S. I have updated my description of the pack format. Any comments are
welcome.

On disk format of version 4 packs (v0.1)
=================================

There is a file name table, EXT_OBJ_FILENAME_TABLE, which is placed
anywhere in the pack file, but before any OBJ_DICT_TREE objects, which
are referencing the table, so that the pack can be easily streamed. It
is using the format:

+-------------------------------+
|  Compressed file name table   |
+-------------------------------+

The uncompressed file name table contains NR_ENTRIES entries,
and looks like this:

+------------+------+--------------+------+--------------------+----
| NR_ENTRIES | MODE |  Full path 1 | MODE | Full path 2        | ...
+------------+------+--------------+------+--------------------+----
   4 bytes    2 bytes   n1 bytes    2 bytes     n2 bytes     

MODE is a network-byte-order integer representing the mode of the path,
and the path is a variable length, null-terminated string.

The table is sorted by path then mode for easy binary lookup, and so
that pointers into this table can be compared directly instead of
comparing the corresponding paths and modes. This table contains the
65535 most used paths in the entire pack.

There is a new tree type called OBJ_DICT_TREE, which looks like the
following:

+--------+----------------+----
| P offs |   SHA-1 offs   | ...
+--------+----------------+----
  2 bytes      4 bytes

That is, each entry contains a 2-byte index into the path table, and a
corresponding 4-byte index into a SHA-1 table.

These new tree objects will remain uncompressed in the pack file, but
sorted with, and deltaed against other tree objects. All normal tree
objects are converted to OBJ_DICT_TREE when packing, and are converted
back on the fly to callers who need an ordinary OBJ_TREE.

The index (.idx) files are extended to have a 4 byte pointer to the
offset of this file name table in the pack file for easy lookup.

There is something similar with a table, EXT_OBJ_IDENT_TABLE of common
strings in commit objects (e.g. author and timezone), and a new object
OBJ_DICT_COMMIT, but I have not understood that quite yet.

next prev parent reply	other threads:[~2007-03-25  8:35 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-24 20:23 Understanding version 4 packs Peter Eriksen
2007-03-24 23:24 ` Nicolas Pitre
2007-03-25  8:35   ` Peter Eriksen [this message]
2007-03-25  9:18     ` Shawn O. Pearce
2007-03-25 17:09       ` Linus Torvalds
2007-03-25 20:31         ` Shawn O. Pearce
2007-03-26  1:12           ` Nicolas Pitre
2007-03-26  2:02             ` Shawn O. Pearce
2007-03-26  8:49               ` Jakub Narebski
2007-03-26 14:01                 ` Nicolas Pitre
2007-03-26 12:16       ` Marco Costalba
2007-03-26 14:27         ` Nicolas Pitre
2007-03-26 17:10           ` Marco Costalba
2007-03-26 18:15             ` Nicolas Pitre
2007-03-26 18:43             ` Nicolas Pitre
2007-03-27  6:46               ` Marco Costalba
2007-03-27  6:55                 ` Shawn O. Pearce
2007-03-25  8:46 ` Shawn O. Pearce
2007-03-25  9:40   ` Shawn O. Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070325083530.GA25523@bohr.gbar.dtu.dk \
    --to=s022018@student.dtu.dk \
    --cc=git@vger.kernel.org \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).