From: Shawn Pearce <spearce@spearce.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Mozilla .git tree
Date: Tue, 29 Aug 2006 21:51:22 -0400 [thread overview]
Message-ID: <20060830015122.GE22935@spearce.org> (raw)
In-Reply-To: <9e4733910608291807q9b896e4sdbfaa9e49de58c2b@mail.gmail.com>
Jon Smirl <jonsmirl@gmail.com> wrote:
> I suspect the bulk of the file will be the base blobs. A zlib
> dictionary would help more with the trees and the 120K copies of the
> GPL in the files.
Here's what I got by taking the output of verify-pack -v run
against the 430 MiB Mozilla pack and running that through a simple
Perl script:
COUNT BASE commit: 197613
COUNT BASE tree: 154496
COUNT BASE blob: 49860
COUNT BASE tag: 1203
COUNT DELTA commit: 3308
COUNT DELTA tree: 976712
COUNT DELTA blob: 579780
COUNT DELTA tag: 353
Those are just raw numbers of objects of each type broken out by
base and delta. We gotta alotta objects. :-)
We probably also have around 49,860 copies of the identical license
text (one per base object). I'm just assuming the xdelta algorithm
would recognize the identical run in the dependent object and
copy it from the base rather than use a literal insert command.
Thus I'm assuming the 579,780 deltas don't contain the license text.
UNCOMP BASE commit: 55 MiB
UNCOMP BASE tree: 30 MiB
UNCOMP BASE blob: 597 MiB
UNCOMP BASE tag: 0 MiB
UNCOMP DELTA commit: 0 MiB
UNCOMP DELTA tree: 44 MiB
UNCOMP DELTA blob: 190 MiB
UNCOMP DELTA tag: 0 MiB
These are the sizes of the objects and deltas prior to using zlib
to deflate them (aka the decompression buffer size, stored in the
object header).
ZIPPED BASE commit: 38 MiB
ZIPPED BASE tree: 26 MiB
ZIPPED BASE blob: 164 MiB
ZIPPED BASE tag: 0 MiB
ZIPPED DELTA commit : 0 MiB
ZIPPED DELTA tree: 73 MiB
ZIPPED DELTA blob: 126 MiB
ZIPPED DELTA tag: 0 MiB
These are the sizes of the objects within the pack, determined by
computing the difference in adjacent objects' offsets.
55 MiB of commits compressed into 38 MiB (saved 30%).
We can probably do better.
30 MiB of tree bases compressed into 26 MiB (saved 13.3%).
With 154,496 tree bases I think we can do better _somehow_. It may
just mean using more deltas so we have less bases. We don't have
154k unique directories. It may just mean using a tree specific
pack dictionary is enough.
44 MiB of tree deltas compressed into 73 MiB (saved -65.9%).
Ouch! We wasted 29 MiB by trying to compress tree deltas.
Way to go zlib!
Blob bases were 597 MiB uncompressed, 164 MiB compressed (saved 72%).
Blob deltas were 190 MiB uncompressed, 126 MiB compressed (saved 33%).
We might be able to do better here, but we're already fairing pretty
well.
To compare a .tar.gz of the ,v files from CVS is around 550 MiB.
We're already smaller than that in a pack file. But ,v is not the
most compact representation. I hoped we could do even better than
430 MiB.
I ran the same script against my Git pack. There I'm seeing the
same explosion of tree deltas: uncompressed they are 1380174 bytes,
compressed they are 1620439 bytes (-17.4% saved).
We may well have a general problem here with always compressing
tree deltas. It appears to be a minor dent in the space required
for a pack but its certainly a non-trivial amount on the larger
Mozilla pack. The wasted space is 2% of the Git pack and its 6.7%
of the Mozilla pack.
--
Shawn.
next parent reply other threads:[~2006-08-30 1:51 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <9e4733910608290943g6aa79855q62b98caf4f19510@mail.gmail.com>
[not found] ` <20060829165811.GB21729@spearce.org>
[not found] ` <9e4733910608291037k2d9fb791v18abc19bdddf5e89@mail.gmail.com>
[not found] ` <20060829175819.GE21729@spearce.org>
[not found] ` <9e4733910608291155g782953bbv5df1b74878f4fcf1@mail.gmail.com>
[not found] ` <20060829190548.GK21729@spearce.org>
[not found] ` <9e4733910608291252q130fc723r945e6ab906ca6969@mail.gmail.com>
[not found] ` <20060829232007.GC22935@spearce.org>
[not found] ` <9e4733910608291807q9b896e4sdbfaa9e49de58c2b@mail.gmail.com>
2006-08-30 1:51 ` Shawn Pearce [this message]
2006-08-30 2:25 ` Mozilla .git tree Shawn Pearce
2006-08-30 2:58 ` Jon Smirl
2006-08-30 3:10 ` Shawn Pearce
2006-08-30 3:27 ` Jon Smirl
2006-08-30 5:53 ` Nicolas Pitre
2006-08-30 11:42 ` Junio C Hamano
2006-09-01 7:42 ` Junio C Hamano
2006-09-02 1:19 ` Shawn Pearce
2006-09-02 4:01 ` Junio C Hamano
2006-09-02 4:39 ` Shawn Pearce
2006-09-02 11:06 ` Junio C Hamano
2006-09-02 14:20 ` Jon Smirl
2006-09-02 17:39 ` Shawn Pearce
2006-09-02 18:56 ` Linus Torvalds
2006-09-02 20:53 ` Junio C Hamano
2006-09-02 17:44 ` Shawn Pearce
2006-09-02 2:04 ` Shawn Pearce
2006-09-02 11:02 ` Junio C Hamano
2006-09-02 17:51 ` Shawn Pearce
2006-09-02 20:55 ` Junio C Hamano
2006-09-03 3:54 ` Shawn Pearce
2006-09-01 17:45 ` A Large Angry SCM
2006-09-01 18:35 ` Linus Torvalds
2006-09-01 19:56 ` Junio C Hamano
2006-09-01 23:14 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano
2006-09-02 0:23 ` Linus Torvalds
2006-09-02 1:39 ` VGER BF report? Johannes Schindelin
2006-09-02 5:58 ` Sam Ravnborg
2006-09-02 1:52 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano
2006-09-02 3:52 ` Junio C Hamano
2006-09-02 4:52 ` Shawn Pearce
2006-09-02 9:42 ` Junio C Hamano
2006-09-02 17:43 ` Linus Torvalds
2006-09-02 10:09 ` Junio C Hamano
2006-09-02 17:54 ` Shawn Pearce
2006-09-03 21:00 ` Junio C Hamano
2006-09-04 4:10 ` Shawn Pearce
2006-09-04 5:50 ` Junio C Hamano
2006-09-04 6:44 ` Shawn Pearce
2006-09-04 7:39 ` Junio C Hamano
2006-09-03 0:27 ` Linus Torvalds
2006-09-03 0:32 ` Junio C Hamano
2006-09-05 8:12 ` Junio C Hamano
2006-09-02 18:43 ` Linus Torvalds
2006-09-02 20:56 ` Junio C Hamano
2006-09-03 21:48 ` Junio C Hamano
2006-09-03 22:00 ` Linus Torvalds
2006-09-03 22:16 ` Linus Torvalds
2006-09-03 22:34 ` Junio C Hamano
2006-09-04 4:06 ` Junio C Hamano
2006-09-04 15:19 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060830015122.GE22935@spearce.org \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=jonsmirl@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).