From: Shawn Pearce <spearce@spearce.org>
To: Jonas Fonseca <jonas.fonseca@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Computing delta sizes in pack files
Date: Sat, 25 Nov 2006 02:33:38 -0500 [thread overview]
Message-ID: <20061125073338.GF4528@spearce.org> (raw)
In-Reply-To: <2c6b72b30611220844t4b341284q4bff914b91eac48d@mail.gmail.com>
Jonas Fonseca <jonas.fonseca@gmail.com> wrote:
> I will not post the numbers here. They are available in
> http://jonas.nitro.dk/tmp/stats.pdf for those interested. The following
> is my "analysis" of the numbers.
Thanks, this was interesting stuff.
> As expected, the randomness of the content of both commit and tag objects
> results in a very poor packing performance of only 2%.
This is one reason why Jon Smirl was pushing the idea of dictionary based
compression. git.git has only 276 unique author lines, yet 37 of them
are really the top committers. Not surprisingly Junio C Hamano leads
the pack with 3529+ commits... :-)
A dictionary based compression would allow us to easily compress
Junio's authorship line away from those 3529+ commits into a single
string, getting much better compression on the commits.
In trees this may work very well too for very common file names, e.g.
"Makefile". Yes each tree delta compresses very well against its
base (and likely copies the file name from the base even when the
SHA1 changed) but if the bases were able to use a common dictionary
that would help even more.
> The data show that for minimal index files, the packs need to contain
> more than 2500 objects. The 24 bytes per-object for the optimal case
> includes 20-bytes for the object SHA1, and thus cannot be expected to
> become lower.
This is just a fundamental property of the pack index file format.
The file *MUST* be 1064 bytes of fixed overhead, with 24 bytes of
data per object indexed. So the fixed overhead amortizes very
quickly over the individual object entries, at which point its
exactly 24 bytes per entry. This all of course assumes a 32 bit
index (which is the current format).
The thing is the Mozilla index is 44 MiB. That's roughly 1.9 million
objects. The index itself is larger than the entire git.git pack.
On a large repository the index ain't trivial... yet its essential
to performance!
On the other hand the 1064 bytes of fixed overhead in the index
is nothing compared to the overhead in say an RCS file. Or an
SVN repository... :-)
What I failed to point out in my script (or in my email) is that
the 24 bytes of index entry cannot be eliminated, and thus must
be added to the "revision cost". In some cases its about the same
size as the deltafied revision in the pack file. :-(
--
prev parent reply other threads:[~2006-11-25 7:33 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-21 5:39 Computing delta sizes in pack files Shawn Pearce
2006-11-22 16:44 ` Jonas Fonseca
2006-11-25 7:33 ` Shawn Pearce [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061125073338.GF4528@spearce.org \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=jonas.fonseca@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).