From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [PATCH 0/3] fast textconv
Date: Sun, 28 Mar 2010 10:53:02 -0400 [thread overview]
Message-ID: <20100328145301.GA26213@coredump.intra.peff.net> (raw)
The normal textconv procedure is to dump the binary file to a tempfile
(optionally using a working tree file if available), then run the
textconv helper to produce a textual version on stdout. This is a very
convenient interface, as helpers don't need to be aware of git at all
and many standard commands can be used without wrappers.
Unfortunately, it can be slow for large binary files. We spool the file
to disk before invoking the textconv helper, so the helper has no way to
do any optimizations. For example, the helper may need only part of the
file (e.g., when showing metadata at the beginning of a media file), or
it may implement a caching scheme to avoid repeating expensive
conversions.
This series introduces a "fast textconv", which does not automatically
spool a tempfile, but instead gives the helper program the sha1 of the
blob to be converted.
Here are some timings from my photo repository, on a commit with 37
JPEGs and 8 AVIs. Each file had two lines added to its exif metadata.
My textconv helper is a perl script that dumps the exif tags, and
implements its own caching scheme.
$ time git show >/dev/null ;# before patch
real 0m13.818s
user 0m12.137s
sys 0m1.552s
$ time git show >/dev/null ;# after patch, first run
real 0m15.076s
user 0m13.321s
sys 0m1.772s
$ time git show >/dev/null ;# after patch, subsequent runs
real 0m2.502s
user 0m1.820s
sys 0m0.592s
So you can see a 5.5x speedup. The first run is a little bit slower,
presumably due to the extra git-cat-file calls by the helper.
The speedup is purely from caching; I am not using the "we only need to
read the first part of the file" optimization. My files are only a few
megabytes. Probably that would be more useful for people storing files
in the hundreds of megabytes, where a full cat-file will cause a lot of
unwanted I/O.
There are two things I'm still not 100% happy with:
1. 2.5 seconds is still a little slower than I would like. The slowness
comes from the fact that my helper is written in perl, and therefore
perl gets invoked for each diff. I could try collecting all of the
to-be-textconv'd files at the beginning of the diff process and just
invoking the helper once. But that means we need to store the
results in core, and they could potentially be long (in my case,
they are only a few hundred bytes, but somebody could potentially be
textconv'ing a large documents).
2. It is up to the helper to implement a caching layer. This offers a
lot of flexibility, but it means each helper must implement its own.
It also means we have to run the helper even for a cache hit, which
causes slowness.
An alternative would be for git to support textconv caching
natively, probably by using the notes mechanism to map blob sha1's
to their textconv'd contents. But that opens a whole can of worms
with how the cache is managed. If I change my textconv helper to
produce different results, how do I invalidate the cache? Should it
happen automatically if I change the contents of
diff.$method.textconv? Or do I need to do it manually (you will
still need to do it manually if, e.g., you upgrade your textconv
helper. Git can't know about that). How do I evict entries if the
cache gets too large when notes are stored as a history?
So I'm not sure. This series works and is simple from git's perspective.
But caching textconv results in notes would be faster, and easier for
people to write helper scripts.
The patches are:
[1/3]: textconv: refactor calls to run_textconv
[2/3]: textconv: refactor to handle multiple textconv types
[3/3]: diff: add "fasttextconv" config option
-Peff
next reply other threads:[~2010-03-28 14:53 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-28 14:53 Jeff King [this message]
2010-03-28 14:53 ` [PATCH 1/3] textconv: refactor calls to run_textconv Jeff King
2010-03-28 14:53 ` [PATCH 2/3] textconv: refactor to handle multiple textconv types Jeff King
2010-03-28 14:54 ` [PATCH 3/3] diff: add "fasttextconv" config option Jeff King
2010-03-28 18:23 ` Johannes Sixt
2010-03-30 16:30 ` Jeff King
2010-03-30 17:36 ` [PATCH] diff: fix textconv error zombies Johannes Sixt
2010-03-30 21:46 ` Junio C Hamano
2010-03-30 22:17 ` Johannes Sixt
2010-03-30 22:56 ` Jeff King
2010-03-28 16:09 ` [PATCH 0/3] fast textconv Michael J Gruber
2010-03-28 16:17 ` Jeff King
2010-03-28 16:19 ` Jeff King
2010-03-28 16:56 ` Jeff King
2010-03-28 17:34 ` Jeff King
2010-03-28 18:13 ` Sverre Rabbelier
2010-03-30 16:04 ` Jeff King
2010-03-30 3:52 ` Junio C Hamano
2010-03-30 17:07 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100328145301.GA26213@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).