From: Thomas Rast <trast@inf.ethz.ch>
To: Ivan Tolstosheyev <ivan.tolstosheyev@gmail.com>
Cc: <git@vger.kernel.org>
Subject: Re: Git tree object storing policy
Date: Tue, 21 Feb 2012 11:18:27 +0100 [thread overview]
Message-ID: <87vcn0bibw.fsf@thomas.inf.ethz.ch> (raw)
In-Reply-To: <loom.20120221T094746-680@post.gmane.org> (Ivan Tolstosheyev's message of "Tue, 21 Feb 2012 09:22:12 +0000 (UTC)")
Ivan Tolstosheyev <ivan.tolstosheyev@gmail.com> writes:
> #!/usr/bin/env bash
>
> git init test
> cd test
> for i in `seq 1 10000`
> do
> touch ${i} ; git add ${i} ; git commit -m "Add ${i}" ;
> done
> cd ..
> du -hs test
[...]
> 180 MB!!!?? and 7.4M after `git gc` - thanks to delta compression!
Most of those 180MB are waste from mostly unused 4KB (presumably) blocks
of your filesystem. You should be looking at the post-gc'd numbers.
Let's see the breakdown of 'du -h .git':
0 .git/rr-cache
1.5M .git/logs/refs/heads
1.5M .git/logs/refs
2.9M .git/logs
4.0K .git/objects/info
2.8M .git/objects/pack
2.8M .git/objects
0 .git/branches
12K .git/info
0 .git/remotes
88K .git/hooks
0 .git/refs/tags
0 .git/refs/heads
0 .git/refs
6.5M .git
So 2.9MB are git keeping a reflog of everything we did (on HEAD and on
master). Since merely storing a SHA1 for each of your 10000 operations
already takes 200K, that's not so far off -- the factor of 10 is in the
email, date and log message.
In my case 704K went into the index (not directly visible above, it's
the bulk of the top level). That's also not unreasonable: merely
storing the object SHA1 (20 bytes) and a bunch of timestamps for 10000
files also gets you into the 500K ballpark.
The pack index amazingly takes only about 500K, even though it is
indexing 10000 trees and 10000 commits, so again the SHA1s alone get you
into the 400K ballpark.
That leaves only 2.3MB for the actual pack (which contains all the
data!). But every commit must store a tree and a parent, so there are
at least 2*10000*20 = 400K uncompressable bytes in the commits
already[*]. So we are within a factor of 6 of just the data required to
save the shape of your history DAG, no content included. I'd say that's
not too bad.
[*] This is not quite true, the parents and trees might be pointers
within the pack. AFAIK the proposed pack v4 format does this, and would
yield a more efficient compression. So if you're going to waste energy
worrying about this, you should help with pack v4.
--
Thomas Rast
trast@{inf,student}.ethz.ch
prev parent reply other threads:[~2012-02-21 10:18 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-21 9:22 Git tree object storing policy Ivan Tolstosheyev
2012-02-21 10:18 ` Thomas Rast [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87vcn0bibw.fsf@thomas.inf.ethz.ch \
--to=trast@inf.ethz.ch \
--cc=git@vger.kernel.org \
--cc=ivan.tolstosheyev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).