From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: Re: [PATCH 0/7] textconv caching
Date: Fri, 2 Apr 2010 02:14:21 -0400 [thread overview]
Message-ID: <20100402061420.GA5551@coredump.intra.peff.net> (raw)
In-Reply-To: <20100402000159.GA15101@coredump.intra.peff.net>
On Thu, Apr 01, 2010 at 08:01:59PM -0400, Jeff King wrote:
> [before]
> $ time git show >/dev/null
> real 0m13.724s
> user 0m12.057s
> sys 0m1.624s
>
> [after (with cache primed)]
> $ time git show >/dev/null
> real 0m0.009s
> user 0m0.004s
> sys 0m0.004s
Since this is a space-time tradeoff, I thought it would make sense to
show some size numbers as a followup.
To get a sense of the size of the repo (it's almost all photos and
videos):
[size of the repo, already fully packed]
$ du -sh .git/objects
4.0G .git/objects
[the number of unique blobs through all history; most are binary media]
$ git log --raw --no-abbrev | awk '/^:/ {print $3 "\n" $4}' | sort -u | wc -l
10605
In comparison, the metadata for a given file (produced by the textconv)
is about 200 bytes of text.
So I did a big cache priming:
$ time git log -p >/dev/null
real 39m29.748s
user 23m1.090s
sys 3m46.642s
Slow, and unsurprisingly spends quite a bit of time waiting on I/O. The
result is a notes tree with almost one textconv per blob:
$ git ls-tree -r notes/textconv/mfo | wc -l
10317
We're now using almost 200M:
$ git count-objects
39513 objects, 198604 kilobytes
But wait. Many of those objects are trees for stale versions of the
cache.
$ git repack -d
$ (cd .git/objects/pack && du -k *.pack)
2056 pack-34170e72ec40a07e98aae044479abccc9e02751b.pack
4089224 pack-81797628f3aebf6a0bdc082fa05ec14932910534.pack
$ git count-objects
30685 objects, 163288 kilobytes
In actuality, a fully packed cache is only about 2M (from 35M of
loose objects; it deltas quite well because there is a lot of overlap
in my metadata). And we can prune away the other 160M of cruft:
$ git prune
$ git count-objects
0 objects, 0 kilobytes
And of course, the final speed result:
$ time git log -p >/dev/null
real 0m7.606s
user 0m6.084s
sys 0m0.788s
So what I take away from this is two things:
1. The size tradeoff is definitely worthwhile for some workloads. In
this case, the textconv version is orders of magnitude smaller than
the original. I'd be interested to see numbers for something like a
repository of documents that get textconv'd to pure ascii.
2. We had 460% loose object overhead just from tree objects in
intermediate versions of the cache. While it was not too hard to
get rid of with a repack and a prune, we are probably better off
not generating it in the first place. In theory we could have
written only one notes tree, and kept the intermediate state in
memory. In practice, flushing once per commit-diff (instead of once
per file) would probably be fine, and would be simpler to
implement.
And of course, now that I have a completely primed cache, I can push it
around with "git push $dest notes/textconv/mfo". Yay for storing notes
as git objects.
-Peff
prev parent reply other threads:[~2010-04-02 6:14 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-02 0:01 [PATCH 0/7] textconv caching Jeff King
2010-04-02 0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
2010-04-02 0:04 ` [PATCH 2/7] fix textconv leak in emit_rewrite_diff Jeff King
2010-04-02 0:05 ` [PATCH 3/7] make commit_tree a library function Jeff King
2010-04-02 0:07 ` [PATCH 4/7] introduce notes-cache interface Jeff King
2010-04-02 0:09 ` [PATCH 5/7] textconv: refactor calls to run_textconv Jeff King
2010-04-02 0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
2010-04-02 7:23 ` Junio C Hamano
2010-04-02 7:38 ` Jeff King
2010-04-02 0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
2010-04-02 7:12 ` Junio C Hamano
2010-04-02 7:24 ` Jeff King
2010-04-02 6:14 ` Jeff King [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100402061420.GA5551@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).