git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pack v4 status
@ 2007-02-27 15:50 Shawn O. Pearce
  2007-02-27 21:51 ` Linus Torvalds
  2007-02-28  4:13 ` Shawn O. Pearce
  0 siblings, 2 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2007-02-27 15:50 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre

Nico's and my packv4 topic is available from my fastimport.git fork
on repo.or.cz:

  gitweb: http://repo.or.cz/w/git/fastimport.git
  git:    git://repo.or.cz/git/fastimport.git
  branch: sp/pack4

We have thus far reformatted OBJ_TREEs with a new dictionary based
compression scheme.  In this scheme we pool the filenames and modes
that appear within trees into a single table within the packfile.
All trees are then converted to use a 22 byte record format:

  - 2 byte network byte order index into the string pool
  - 20 byte SHA-1

These trees are then stored *uncompressed* within the packfile,
but are also still stored using our standard delta system (only the
deltas for these trees are also stored uncompressed).  The resulting
savings is pretty good; on linux-2.6.git we are saving ~3.8 MiB as
a result of this encoding alone:

	141649022 pack2-linuxA.git
	137625761 pack4-linuxB.git

read_sha1_file() has been modified to unpack this new tree format
back into the canonical format; something that I think is very
unncessary for runtime given how easy it is to iterate the encoded
tree, but is still critically important for tools like git-cat-file,
git-index-pack and git-verify-pack.  Future plans are to iterate
the encoded tree directly, but performance is already faster despite
needing to reconvert the tree:

  lh=825020c3866e7312947e17a0caa9dd1a5622bafc
  git --git-dir=pack2-linux.git rev-list $lh -- include/asm-m68k
        3.97 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys

  git --git-dir=pack4-linux.git rev-list $lh -- include/asm-m68k
        3.52 real         3.17 user         0.13 sys
        3.46 real         3.17 user         0.13 sys
        3.51 real         3.17 user         0.13 sys
        3.52 real         3.18 user         0.13 sys
        3.53 real         3.16 user         0.13 sys

I'll take 500 milliseconds savings anyday, thanks!  :-)


Nico and I have only started working on commits, so the above results
still utilize the packv2 format for OBJ_COMMIT and do not take into
account any of our proposed concepts there.


The impetus for packv4 is to format the packfile in such a way that
we can work with the data faster at runtime for common operations,
like rev-list and its builtin path limiter.  We also want to make
reachability analysis (critical for packing and fsck) faster.
Any reduction in storage size is considered a bonus here, though
obviously there is some correlation between size of input data and
the time required to process it.  ;-)

The patch series for this is getting large.  Right now we are up to
32 patches in the series.  Given where we are and where we want to
go I'm predicting this series will come out at close to 100 patches.
Of course that's partly because I'm working in fairly small units,
slowly iterating the code into the final version we want.

I am constantly rebasing the sp/pack4 topic noted above, so the
patch count is not really because I'm going back and fixing things
in later patches.  Its because I'm trying to slowly iterate the
runtime side of things in digestable changes, then the packing side,
so that the system still works at every single commit in the series.
Yes, its a *BIG* set of code changed.

Obviously this series has a heavy hand on sha1_file.c,
builtin-pack-objects.c, builtin-unpack-objects.c, index-pack.c.
But it will also start to hit less obvious places like commit.c
and tree-walk.c as we start to support walking the encoded objects
directly.

Given the huge size of the series, and the amount of effort we are
tossing into it, and the fact that I'm trying to make it pu-ready by
early next week, we would appreciate it if folks could keep changes
to the above mentioned files limited to critical bug fixes only.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: pack v4 status
@ 2007-02-28 10:04 linux
  0 siblings, 0 replies; 10+ messages in thread
From: linux @ 2007-02-28 10:04 UTC (permalink / raw)
  To: spearce; +Cc: git, linux

Just a minor note:

I you want to have an arbitrary-sized string table without special cases,
the standard binary Huffman tree construction algorithm can be extended
quite simply to a radix-256 tree, so every code is an integer number
of bytes long.

Take the input strings, and sort them in order of frequency of appearance.
Pad them at the end with dummy strings (frequency = 0) until the number
equals 1 mod 255.

Then take the last (least frequent) 256 strings off the list, and
replace them with a single prefix whose frequency is the sum of all of
the components.  Each of the selected strings will be encoded as the
prefix (whose code has not yet been decided) plus 256 possible byte values.
Insert the new prefix into the list in sorted order.

Repeat until the list has only one entry, which is represented by a
zero-length prefix.  (Since each step replaces 256 entries by 1 entry,
the number of entries never changes mod 255, so the algorithm always will
converge to one entry.)


This is generally encoded more compactly as a "canonical Huffman tree".
For each string, forget the actual encoding decided on here, and just
remember the number of bytes assigned to encode it; this is the only
number you need to record to reconstruct the tree for decoding.
Then count the number of strings to be encoded with 1, 2, 3... bytes.

To decode, read one byte.  If the byte is less than the number of 1-byte
strings, its value is the string's position in the 1-byte string table.

If the byte is >= the number of 1-byte strings, subtract the number of 1-byte
strings, multiply by 256, and add a following byte.  This is the position in
the 2-byte string table.

If the value is >= the number of 2-byte strings, subtract, multiply by
256, and add another trailing byte.  This is the position in the 3-byte
string table.  Repeat as necessary.


There are elaborations on this to bound the maximum number of bytes
required; see the zlib source for details.

If you have a significant number of unique (frequency == 1) strings,
you can encode them in-line after a common "unique string" prefix
(which can have a frequency greater than 1).


I'm not sure if you can follow all that (it helps to understand Huffmann
trees already), but I can explain at greater length if necessary.
The important thing is that it's quite straightforward and provides
a general solution.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-02-28 11:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-27 15:50 pack v4 status Shawn O. Pearce
2007-02-27 21:51 ` Linus Torvalds
2007-02-27 22:15   ` Johannes Schindelin
2007-02-27 22:33     ` Nicolas Pitre
2007-02-27 22:32   ` Nicolas Pitre
2007-02-27 22:36     ` Junio C Hamano
2007-02-28  3:45       ` Shawn O. Pearce
2007-02-28  1:19     ` Nicolas Pitre
2007-02-28  4:13 ` Shawn O. Pearce
  -- strict thread matches above, loose matches on Subject: below --
2007-02-28 10:04 linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).