git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: Re: [PATCH 0/7] textconv caching
Date: Fri, 2 Apr 2010 02:14:21 -0400	[thread overview]
Message-ID: <20100402061420.GA5551@coredump.intra.peff.net> (raw)
In-Reply-To: <20100402000159.GA15101@coredump.intra.peff.net>

On Thu, Apr 01, 2010 at 08:01:59PM -0400, Jeff King wrote:

>   [before]
>   $ time git show >/dev/null
>   real    0m13.724s
>   user    0m12.057s
>   sys     0m1.624s
> 
>   [after (with cache primed)]
>   $ time git show >/dev/null
>   real    0m0.009s
>   user    0m0.004s
>   sys     0m0.004s

Since this is a space-time tradeoff, I thought it would make sense to
show some size numbers as a followup.

To get a sense of the size of the repo (it's almost all photos and
videos):

  [size of the repo, already fully packed]
  $ du -sh .git/objects
  4.0G    .git/objects

  [the number of unique blobs through all history; most are binary media]
  $ git log --raw --no-abbrev | awk '/^:/ {print $3 "\n" $4}' | sort -u | wc -l
  10605

In comparison, the metadata for a given file (produced by the textconv)
is about 200 bytes of text.

So I did a big cache priming:

  $ time git log -p >/dev/null
  real    39m29.748s
  user    23m1.090s
  sys     3m46.642s

Slow, and unsurprisingly spends quite a bit of time waiting on I/O. The
result is a notes tree with almost one textconv per blob:

  $ git ls-tree -r notes/textconv/mfo | wc -l
  10317

We're now using almost 200M:

  $ git count-objects
  39513 objects, 198604 kilobytes

But wait. Many of those objects are trees for stale versions of the
cache.

  $ git repack -d
  $ (cd .git/objects/pack && du -k *.pack)
  2056    pack-34170e72ec40a07e98aae044479abccc9e02751b.pack
  4089224 pack-81797628f3aebf6a0bdc082fa05ec14932910534.pack
  $ git count-objects
  30685 objects, 163288 kilobytes

In actuality, a fully packed cache is only about 2M (from 35M of
loose objects; it deltas quite well because there is a lot of overlap
in my metadata). And we can prune away the other 160M of cruft:

  $ git prune
  $ git count-objects
  0 objects, 0 kilobytes

And of course, the final speed result:

  $ time git log -p >/dev/null
  real    0m7.606s
  user    0m6.084s
  sys     0m0.788s

So what I take away from this is two things:

  1. The size tradeoff is definitely worthwhile for some workloads. In
     this case, the textconv version is orders of magnitude smaller than
     the original. I'd be interested to see numbers for something like a
     repository of documents that get textconv'd to pure ascii.

  2. We had 460% loose object overhead just from tree objects in
     intermediate versions of the cache. While it was not too hard to
     get rid of with a repack and a prune, we are probably better off
     not generating it in the first place. In theory we could have
     written only one notes tree, and kept the intermediate state in
     memory. In practice, flushing once per commit-diff (instead of once
     per file) would probably be fine, and would be simpler to
     implement.

And of course, now that I have a completely primed cache, I can push it
around with "git push $dest notes/textconv/mfo". Yay for storing notes
as git objects.

-Peff

      parent reply	other threads:[~2010-04-02  6:14 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
2010-04-02  0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
2010-04-02  0:04 ` [PATCH 2/7] fix textconv leak in emit_rewrite_diff Jeff King
2010-04-02  0:05 ` [PATCH 3/7] make commit_tree a library function Jeff King
2010-04-02  0:07 ` [PATCH 4/7] introduce notes-cache interface Jeff King
2010-04-02  0:09 ` [PATCH 5/7] textconv: refactor calls to run_textconv Jeff King
2010-04-02  0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
2010-04-02  7:23   ` Junio C Hamano
2010-04-02  7:38     ` Jeff King
2010-04-02  0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
2010-04-02  7:12   ` Junio C Hamano
2010-04-02  7:24     ` Jeff King
2010-04-02  6:14 ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100402061420.GA5551@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).